Skip to content

Deciduous Tree Leaf Identification with Artificial Neural Networks - Year 2. By Patrick Thomas, mentor Rick Fisher, to compete in science fair opportunites with the Southwest Virginia Governor's School.

License

Notifications You must be signed in to change notification settings

patthomasrick/DTLIwANNy2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Link to Google Drive folder with project files, posters, papers, presentations, and images

Abstract

Artificial neural networks (ANNs) excel at identifying patterns in data. This project tested to see whether an ANN would be able to read measurements of deciduous tree leaves in order to accurately and correctly identify. Rather than using internet images, leaf images were collected by hand by the researcher from locations in Southwest Virginia. The problem was approached in Python 3.5.2, and the multilayer feed-forward backpropagated ANN was built with the Fast Artificial Neural Network library and utilized one hidden layer. This year’s iteration of this project aimed to improve upon last year’s project in terms of accuracy. Features such as length, breadth, perimeter, area, margin variability, contours, and vein features were all automatically measured from leaf images by the program. The data measured from leaves were stored in Extensible Markup Language (XML) files. The ANN was trained to the measurements from the leaves. The program was tested on both the leaves collected this year and the leaves from last year’s project, and random pairs of leaves of varying number were taken to fully test the program. For the new leaves and the new program, its accuracy varied from 76.00% for 2 species to 36.81% for 7 species. For the old leaves, it varied from 86.67% accuracy for 2 species to 26.18% for 11 species. The current project outperformed the program from last year with significance, however the low accuracy of the program in regards to greater numbers of species still makes the program ineffective for real life applications needing accurate identifications for larger numbers of leaves.  

Introduction

Trees exist all around us; however, identifying them is often inconvenient. Without either investing a significant amount of time studying in the field or spending time looking through an identification guide, identifying trees to this day remain a task that is unnecessarily difficult given the technology of today. Additional problems remain in identifying the species of trees manually. The processes of distinguishing the leaf on others is very prone to human error than thus is often not accurate, and, as for the identification book method, it can be inconvenient to have to carry around one or more books. Nowadays, with the mass introduction of mobile computing devices, the technology to digitalize the identification of trees seems very much possible. The technology should exist to make knowledge regarding trees more accessible to those with access to computers.

Artificial neural networks (ANNs) are one such technology that shown proficiency at pattern and image classification. ANNs have received extensive testing in areas such as satellite imaging in Hepner’s research, where ANNs were shown to outperform conventional classification procedures (Hepner, 1990). ANNs were used by Li et al. to distinguish smoke from forest fires against in satellite radiometer images (Li et al., 2001). In contrast to satellite imagery, ANNs have also found use in smaller-scale applications such as face detection and identification (Rowley et al., 1998).

This project aims to use ANNs to identify the species of trees. The ANN will be constructed with Nissen’s open-source Fast Artificial Neural Network (FANN) library (Nissen, 2003). FANN allows for machine learning and real-time analytical decision-making and was designed with intuitiveness and speed in mind. The flexibility and versatility of ANNs make them an excellent solution to a wide array of problems, such as image classification. One pitfall of artificial neural networks is the training required to properly prepare ANNs for usage. Training is an often-slow process; prone to overtraining, which is the name for when an ANN is trained to not look for patterns in data but raw numbers themselves; and sometimes hard to complete due to a large set of training data being hard to produce. Overtraining and overfitting are the rather more critical problems of ANNs, as described by Tetko et al. (Tetko et al., 1995). The ANN used has a hidden neuron layer the size of two-thirds the sum of the input and output layers (Sarle, 2014). no.of hidden neurons=2/3(no.of input neurons+no.of output neurons)

The ANN will be trained to look at the length, perimeter, width, veins, and contours of a leaf to hopefully be able to tell different species of leaves apart.

Python 3.5.2 was the development environment used. NumPy 1.11.1 was used for its array manipulations, the storage and retrieval of data, and as other libraries’ dependencies on NumPy (Van der Walt et al., 2011). In addition to NumPy, SciPy 0.18.1 was also used in this project (Jones et al., 2015). From SciPy, Hunter’s Matplotlib 2.0.0b4 and Pérez et al’s IPython 5.0.0 were used to generate figures and to assist with development respectively (Hunter, 2007; Pérez et al., 2007). Van der Walt et al’s Scikit-image 0.12.3 was used to process images of tree leaves (Van der Walt et al., 2014).

This project is a continuation and aims to improve upon the original project that this is based on. The first year of this project, Deciduous Tree Leaf Identification with Artificial Neural Networks, used the same libraries, albeit Python 2.7.11 and older versions of all other libraries, and had the same goals and motivations. A major difference between this year’s project and last year’s is the source of the leaf images. In year 1’s project, images were collected from the Internet, namely ClearedLeavesDB (Das et al., 2014). In this iteration of the project, however, leaves were collected by hand from three different locations in Southwest Virginia: Mountain Lake, Pandapas Pond, and Angel’s Rest (a location on the Appalachian Trail near Pearisburg). Since leaves were manually collected, a scale of the leaf was able to be obtained. This project aims to also replace the width measurement system used by year 1’s program, as it did not strongly correlate to accuracy. It is replaced by a new proportion-based contour system that measures the angles between points between the widest section on the leaf as well as just width measurements.

Methods and Materials

The project was programmed in Python 3.5.2 (as opposed to last year’s Python 2.7.10). The computer used ran Ubuntu 16.04 LTS, and the development environment was contained in PyEnv. The environment included the packages SciPy 0.18.1 (which included Matplotlib 2.0.0b4, NumPy 1.11.1, IPython 5.0.0, and some other libraries not used in this project), Scikit-Image 0.12.3 (Skimage), and Python Bindings for Fast Artificial Neural Network Library 1.0.7 (operating on FANN 2.2.0).

Images of leaves were collected by hand from 3 locations in Southwest Virginia: Mountain Lake, Pandapas Pond, and Angel’s Rest. Similar to last year’s project, only simple leaves were collected as opposed to compound leaves. The collection process went as follows: petioles of leaves were first removed, as not plans were made to measure the petioles. Leaves were placed to the right of a scale with the tip of the blade of the leaf oriented to the right. The scale consisted of only a black diamond (2 cm from adjacent corners). Leaves were made to be as level as possible, and then an image was taken of the leaf on the scale. Leaves were then sorted by species on a computer. The accuracy of the researcher’s own classification of the leaves is uncertain. However, for the purposes of this project, as long as similar leaves were grouped, it should have functioned as planned. Only the leaves with sufficient numbers of images taken were used, as leaves with few images (<10) were unwise to use with the ANN.

The program was designed to automatically measure leaves given to it independently of the user. In order to try and provide enough information for the ANN to identify the leaves accurately, the following features were measured from leaves: length, breadth, perimeter, area, margin variability, contours, and vein features. Breadth is the widest width of the leaf. Margin variability was measured by calculating the mean change per of the leaf per column in centimeters. Contours were measured in two ways: widths and angles. Leaves were split on their widest width measurement and then split into further quarters and eighths proportionally. After the position of those values was found, the width of the leaf at those points was measured. Angles were measured by finding the points by which the widths were measured, and then trigonometry was performed to get the angle for both the points above and below the midrib of the leaf. The absolute values of the angles were averaged to get the angle for the numbered section. Finally, vein measurements were also gathered in many ways. In order to get the veins, the Canny edge detection function was used to get the edges of the leaf. Veins were separated from the margin of the leaf by subtracting the margin of the thresholded leaf from the binary image from the Canny edge method. The Hough probabilistic line transformation was then used to gather line segments roughly representing the veins of the leaf. The lines that were nearly level and near the center of the image were classified as midrib lines. Then, the average angle of the center lines was said to be the angle of the midrib, this giving a linear approximation of the midrib. The line segments above and below the midrib were said to be veins branching off of the midrib and were measured similarly.

After measuring, the data generated from the program is stored in multiple files. For files that are not in an array-like format, it is all stored in one unified Extensible Markup Language (XML) file. For data that was either in arrays, lists, or dictionaries, it was stored in the NPZ file format, which was a zipped NPY file (a standard binary file format specifically designed for saving NumPy arrays). Every image had its own corresponding NPZ file; however, all numerical data of the images (length, perimeter, etc.) were saved to the same XML file.

The ANN built by FANN was a multilayer feed-forward back-propagated neural network. The ANN had one hidden neuron layer. The input layer consists of a leaf’s midrib length, perimeter, area, width, contour widths, contour angles, vein length, vein angle above and below the midrib, and margin variability. The output layer consisted of a value between -1 and 1 for every species the ANN was trained to identify. The greatest single output was treated as the ANN’s final guess, which was then compared to the leaf’s actual species. Whether the ANN was correct or no was recorded, and this process was repeated many times for all species used in the collection of leaves. The results of runs of the ANN were used to retroactively manually adjust the ANN.

To compare the new program to the old program, the new program was run with the old leaf images and the old program was run with the new images. Because of the random nature of training ANNs, multiple repetitions of running the ANN were done. Once data was collected, a bar graph with error bars (represented by standard error) was used look for significant results. Since this project was a continuation and large changes were introduced to the program, the test also compared the results to that of the older project as well.

No real threats to anyone’s safety, including the researcher’s, were identified. Likewise, no special safety procedures were required.

Results

Both the leaf images from the first year and second year of this project were measured and ran to compare the effectiveness of the programs. Measurements of the leaves can be found in appendix tables A1 and A2. Since the original leaves in the images were never measured, the measurements returned from the program were taken as-is.

Firstly, the newly gathered leaves were tested in the current iteration of the program. The leaves, having been filtered by the quality of the measurements, totaled to 152 images. To eliminate the species of leaves that were underrepresented in the set, species with less than 10 images were omitted from testing, leaving seven species for the new images.

The accuracy of the ANN behaved similarly to last year’s project, losing accuracy as the number of leaves in a set increased. The accuracy for each set was computed as the average of the ANN’s correctness for 100 runs, with each run having its own random set of species. When discerning between two species, the current program had an 76.00% accuracy. The accuracy is lowest at the greatest number of species in a set, seven. With seven species, the ANN had an accuracy of 36.81%. The trendline, whose form was in a power, can be represented with the equation y = 0.8298x-0.425. The corresponding correlation coefficient R2 = 0.931. This is all shown in Chart B2.

The results of the current program running on the internet images of leaves from the previous year’s project were slightly better. For any two species in a set, the average accuracy for 100 runs was 86.67%. As expected, the accuracy decreased as the number of leaves in a set increased. In comparison to the new leaves, at 6 species, the accuracy was 43.56%. For 11 species, the maximum possible for the old leaf images, the accuracy was 26.18%. The loss in accuracy can be represented best with a power trendline, whose equation is y = 0.9956x-0.55 and correlation coefficient R2 = 0.9663. Refer to Chart B1 for more in-depth information.

For two and three species in a set, the old leaves were significantly easier for the ANN to discern than the new images. However, for the rest of the comparable results, the rates were not significantly different.

Discussion

In comparison to year one’s project, the program on average performed slightly better. For any three species, the average accuracy was 61.57% (standard error of 1.37%), and for any eleven species it was 23.57% (standard error of 0.40%). In comparison, for the images from year one and the program from year two, for three species it was 77.00% accurate and 26.18% accurate for eleven species (standard error of 1.71% and 1.16% respectively). Thus, the year two program is significantly more accurate than year one’s program for these two comparisons.

The leaves from year one being easier to identify than the leaves from year two was not expected. It is further puzzling that the new set of leaves have one more reliable factor that aids identification: a correct scale for metric length. More robust methods of measuring the leaves are needed, perhaps ones that very dynamically measure leaves, regardless of a leaf’s natural curve and also able to correctly represent all leaves. There are perhaps bugs in the way that length, perimeter, and area are measured in relation to the scale, so that the constant scale actually helped the year one leaves and the absence of hurt the year two leaves.

Another difference in the year two leaves is that while there was a wide variety of leaves, there were not many leaves per species collected, making many of the species unusable. Some of the species used only had about 10 images for the species, making only seven used to train the ANN and three used to test the ANN’s training.

ANN parameters still need to be more adjusted as well. Utilizing two hidden neuron layers early in the testing of the ANN produced very poor results, with the ANN matching the expected accuracy if the it was to be only randomly guessing. While Comp.ai.neural-nets’s suggested hidden neuron formula works for one layer, it may not necessarily translate to work for multiple hidden layers. (Sarle, 2014)

As for real life applications of this technology, uses are again limited. For small numbers of leaves or where accuracy is not necessarily of highest concern, this program could be used to batch process many leaves (or leaf-like objects) at once, with more already known leaves leading to better sorting. As shown by the research, however, large numbers of species would not be well suited for this program in its current state.

About

Deciduous Tree Leaf Identification with Artificial Neural Networks - Year 2. By Patrick Thomas, mentor Rick Fisher, to compete in science fair opportunites with the Southwest Virginia Governor's School.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages