In the study of objects within our solar system, there have been many attempts to classify groups of objects to help estimate their properties. However, the classical approach can miss the subtle correlations that machine learning techniques thrive on. This study aims to enhance the prediction of asteroid features using machine learning algorithms. We aim to utilize a dataset provided by Jet Propulsion Laboratory of California Institute of Technology, and apply various regression techniques to achieve higher accuracy and low error rates in feature prediction. The dataset comprises 31 features for 839,714 objects, including their names, semi-major axis, eccentricity, inclination, orbital period, diameter, and other orbital elements. Our project focuses on utilizing feature engineering, linear and polynomial regression models. Additionally, we aim to use clustering algorithms to attempt to classify asteroids. Our findings contribute to the growing intersection between machine learning and astronomy, providing robust tools for potential applications in space warning systems.
This project analyzes a dataset with 839,714 observations and 31 features. The analysis includes data cleaning, encoding, and visualization to understand correlations and distributions.
The features in our data have been described below.:
Click to view table containing details of data.
Feature Name | Description |
---|---|
full_name | Full Name of Body: Contains full unique name of the body. |
a | Semi-Major Axis (Unit - au): The average distance between the object and the Sun, measured in astronomical units (au). |
e | Eccentricity: Describes the shape of the object's orbit, with values ranging from 0 (circular) to close to 1 (highly elliptical). |
G | Magnitude Slope Parameter: Factor in determining the brightness variation of the object, reflecting how its brightness changes with phase angle. |
i | Inclination (Unit - deg): Angle of the object's orbital plane relative to the plane of the solar system, measured in degrees. |
om | Longitude of the Ascending Node: Angle from the reference direction (usually the vernal equinox) to the point where the object's orbit crosses the plane of the solar system from South to North. |
w | Argument of Perihelion: Angle between the ascending node and the point of closest approach to the Sun (perihelion). |
q | Perihelion Distance (Unit - au): Shortest distance between the object and the Sun during its orbit, measured in astronomical units (au). |
ad | Aphelion Distance (Unit - au): Farthest distance between the object and the Sun during its orbit, measured in astronomical units (au). |
per_y | Orbital Period: Time taken for the object to complete one full orbit around the Sun, measured in years. |
data_arc | Data Arc-Span (Unit - Days): Duration over which observations of the object have been collected, measured in days. |
condition_code | Orbit Condition Code: Numerical code indicating the quality and reliability of the object's orbital data, with 0 being the most reliable. |
n_obs_used | Number of Observations Used: Total number of observations used to determine the object's orbital parameters. |
H | Absolute Magnitude Parameter: Measure of the object's intrinsic brightness, indicating its size and reflectivity. |
diameter | Diameter of Asteroid (Unit - Km): Physical size of the asteroid, measured in kilometers (km). |
extent | Object Bi/Tri-Axial Ellipsoid Dimensions (Unit - Km): Dimensions describing the shape and size of the object in terms of its three principal axes, measured in kilometers (km). |
albedo | Geometric Albedo: Reflectivity of the object's surface, indicating the proportion of sunlight it reflects. |
rot_per | Rotation Period (Unit - Hours): Time taken for the object to complete one full rotation on its axis, measured in hours. |
GM | Standard Gravitational Parameter: Product of the gravitational constant and the object's mass, used in gravitational calculations. |
BV | Color Index B-V Magnitude Difference: Difference in brightness between the object in the B (blue) and V (visual) photometric bands, indicating its color. |
UB | Color Index U-B Magnitude Difference: Difference in brightness between the object in the U (ultraviolet) and B (blue) photometric bands, providing spectral information. |
IR | Color Index I-R Magnitude Difference: Difference in brightness between the object in the I (infrared) and R (red) photometric bands, conveying thermal properties. |
spec_B | Spectral Taxonomic Type (Unit - SMASSII): Spectral classification of the object based on the SMASSII scheme, indicating its mineral composition and surface features. |
spec_T | Spectral Taxonomic Type (Unit - Tholen): Spectral classification of the object based on the Tholen system, indicating its spectral characteristics, composition, and origin. |
neo | Near Earth Object: Indicates whether the object is classified as a Near Earth Object (NEO), meaning its orbit brings it close to Earth's orbit. |
pha | Potentially Hazardous Asteroid: Identifies whether the object is classified as a Potentially Hazardous Asteroid (PHA), posing a potential threat to Earth. |
moid | Earth Minimum Orbit Intersection Distance (Unit - au): Smallest distance between the object's orbit and Earth's orbit, measured in astronomical units (au), indicating potential close encounters. |
class | Class of Asteroid: Visit nasa.com to learn more about classes |
n | Unsure of what this is |
per | Period |
ma | ma |
We utilized a heatmap to visualize the correlations between different features in the dataset. This graphical representation helps in identifying the strength and direction of relationships among the variables, providing a clear and intuitive way to detect patterns, trends, and anomalies in the data. The heatmap is particularly useful for understanding multicollinearity, guiding feature selection, and improving model performance.
We employed the .describe() method to obtain a statistical summary of the features we are interested in. This summary includes metrics such as count, mean, standard deviation, minimum, and maximum values, as well as the 25th, 50th, and 75th percentiles. These statistics provide valuable insights into the central tendency, dispersion, and overall shape of the data distribution, facilitating the identification of outliers and informing subsequent data preprocessing and analysis steps.
To prepare the dataset for analysis, we undertook several preprocessing steps:
-
Remove String Columns:
- We dropped the columns
name
,spec_B
,spec_T
, andclass
as they contain string values that are not suitable for numerical analysis.
- We dropped the columns
-
Handle Missing Values:
- We dropped the columns
rot_per
,GM
,BV
, andUB
due to a high number of NaN values. - We removed any rows that contained NaN values for the diameter feature in order to ensure a clean dataset for analysis.
Original dataset size: 839714
Dataset size after dropping rows: 24404
Number of rows dropped: 815310
- We dropped the columns
-
Check Correlations:
- By plotting pairplots and the heatmap, we discovered a reasonably strong correlation between
diameter
and the following features:q
,moid
,H
,data_arc
, andn
. - These correlations will be explored further in subsequent steps to understand their impact and relationships.
- By plotting pairplots and the heatmap, we discovered a reasonably strong correlation between
-
Data Encoding:
- We change the values in the
pha
containing string values of 'Y' or 'N' to 1s and 0s to make graphing and working on them easier.
- We change the values in the
-
Analyzing effects of preproccesing on data distribution:
- Using the scipy.stats.ks_2samp https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html we found that dropping the NAN rows did not severely effect the distribution. The KS test checks for the likelyhood that two samples were drawn from the same distribution, and for the variables we are interested in found p-values of 2.488278122363494e-60 for q, 0.0 for H and 1.4086431738613219e-53 for moid. All indicate that the effect was negligible.
To better understand the relationships between various features and the diameter, we graphed several feature correlations. This graphical analysis aids in identifying potential relationships and patterns that might not be immediately evident through raw data or simple statistical summaries.
-
Diameter vs. q:
- We plotted the relationship between diameter and q (perihelion distance). This scatter plot helps us observe any direct or inverse relationships between the size of the object and its perihelion distance.
-
Diameter vs. moid:
- The scatter plot between diameter and moid (minimum orbit intersection distance) was analyzed to see if there is any correlation between the object's size and its closest approach to Earth.
-
Diameter vs. H:
- We also examined the correlation between diameter and H (absolute magnitude). This plot is particularly interesting as it helps in understanding how the brightness of an object might relate to its size.
-
Diameter vs. n:
- Analyzing the scatter plot of diameter versus n (number of observations) can reveal whether more observations correlate with more accurate or different size estimations.
-
Correlation Difference after dropping NAN values in preproccesing
-
Distribution Difference after dropping NAN values in preproccesing
- Histogram of q:
- Histogram of H:
- Histogram of moid:
These visualizations provide several insights:
-
Identifying Outliers:
- Scatter plots help in easily identifying any outliers that may exist in the data, which could potentially skew the analysis or indicate errors or special cases.
-
Understanding Distribution:
- The spread and clustering of points in these graphs can provide an understanding of how uniformly or variably the features are distributed.