Skip to content

Alphabet Soup, a nonprofit foundation, aims to optimize its funding decisions by predicting the success of applicants using a deep learning model. This project utilizes machine learning and neural networks to create a binary classifier that determines whether an organization, if funded, will be successful based on historical data.

Notifications You must be signed in to change notification settings

tinaland101/deep-learning-challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import pandas as pd import tensorflow as tf import ssl import certifi What This Does: train_test_split : Splits data into training and testing sets. StandardScaler :Normalizes numerical features for better model performance. pandas (pd) :Handles data manipulation and CSV reading. tensorflow (tf) :Builds and trains the deep learning model. ssl & certifi : Ensures secure HTTPS requests when loading the dataset. 2️. Fixing SSL Errors for Secure URL Access

ssl._create_default_https_context = ssl.create_default_context(cafile=certifi.where()) Some systems have SSL verification issues when accessing online data. This ensures Python uses the latest SSL certificates for a secure connection. 3️. Loading the Dataset

application_df = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv") application_df.head()

Reads the charity dataset from the provided cloud URL. .head() displays the first 5 rows to check if the dataset loaded correctly. 4️. Dropping Non-Beneficial Columns

application_df = application_df.drop(columns=["EIN", "NAME"]) EIN and NAME are identifiers and do not contribute to predicting success, so we remove them. 5️. Checking Unique Values for Data Cleaning

application_df.nunique() Determines how many unique values exist in each column. Helps decide which categorical values should be grouped into “Other.” 6️. Grouping Rare Categories as “Other” Application Type

application_type_counts = application_df["APPLICATION_TYPE"].value_counts() application_types_to_replace = list(application_type_counts[application_type_counts < 500].index)

for app in application_types_to_replace: application_df["APPLICATION_TYPE"] = application_df["APPLICATION_TYPE"].replace(app, "Other") Counts occurrences of each APPLICATION_TYPE. Groups categories with fewer than 500 occurrences into "Other" to simplify the model. Classification Type

classification_counts = application_df["CLASSIFICATION"].value_counts() classifications_to_replace = list(classification_counts[classification_counts < 1000].index)

for cls in classifications_to_replace: application_df["CLASSIFICATION"] = application_df["CLASSIFICATION"].replace(cls, "Other") Similar process as above but for CLASSIFICATION values. Groups low-frequency classifications into "Other" for better generalization. 7️. Converting Categorical Data to Numeric

application_df = pd.get_dummies(application_df) Converts categorical variables (e.g., "APPLICATION_TYPE") into one-hot encoded numeric columns. This is necessary because machine learning models work with numbers, not text. 8️. Splitting the Data into Features (X) and Target (y)

X = application_df.drop(columns=["IS_SUCCESSFUL"]).values y = application_df["IS_SUCCESSFUL"].values X contains all the input features. y contains the target variable (IS_SUCCESSFUL), which we want to predict. 9️. Splitting Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 80% training data, 20% testing data (test_size=0.2). random_state=42 ensures reproducibility (same results each time). 10. Scaling the Features for Better Training

scaler = StandardScaler() X_scaler = scaler.fit(X_train)

X_train_scaled = X_scaler.transform(X_train) X_test_scaled = X_scaler.transform(X_test) StandardScaler standardizes numerical values (mean = 0, std = 1) for better model performance. Scaling is necessary for deep learning models to prevent large variations in feature values. 11. Building the Neural Network Model

nn = tf.keras.models.Sequential() Sequential model → Defines a linear stack of layers for our deep learning model. First Hidden Layer

nn.add(tf.keras.layers.Dense(units=80, activation="relu", input_dim=X_train.shape[1])) 80 neurons → Determines how much information is processed. ReLU activation → Helps avoid negative values & speeds up training. input_dim=X_train.shape[1] → Defines input layer size (same as feature count). Second Hidden Layer

nn.add(tf.keras.layers.Dense(units=30, activation="relu")) 30 neurons → A smaller layer helps learn more complex patterns. ReLU activation → Keeps only positive values. Output Layer

nn.add(tf.keras.layers.Dense(units=1, activation="sigmoid")) 1 neuron → Outputs a probability (0 to 1) for binary classification. Sigmoid activation → Converts values into probabilities for success/failure classification.

You said: explain all of the libraries, tools and ChatGPT said: Explanation of All Libraries, Tools, and Their Uses Your project relies on several key libraries and tools from machine learning, deep learning, and data preprocessing. Below is a breakdown of each one, why it’s used, and how it contributes to the project.

1️. Pandas (pandas) What It Does:

Pandas is a powerful library for working with structured data (tables, CSV files, etc.). It allows for data cleaning, manipulation, and preprocessing before feeding it into the model. How It's Used in Your Code:

import pandas as pd application_df = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv") Reads a CSV file from a URL and loads it into a DataFrame (application_df). DataFrames provide an easy-to-use structure to inspect and transform data. 2️. Scikit-Learn (sklearn) What It Does:

Scikit-Learn is a popular machine learning library for data preprocessing and model evaluation. a) Train-Test Split

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) Splits data into training (80%) and testing (20%) sets. Ensures that the model is trained on one portion and tested on unseen data. b) Feature Scaling (StandardScaler) python Copy Edit from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaler = scaler.fit(X_train) X_train_scaled = X_scaler.transform(X_train) X_test_scaled = X_scaler.transform(X_test) Standardizes numerical data (mean = 0, standard deviation = 1). Ensures that no single feature dominates because of larger values (e.g., income vs. age). 3️. TensorFlow (tensorflow) What It Does:

TensorFlow is an open-source deep learning framework. Provides tools for building and training neural networks. Building the Model

import tensorflow as tf nn = tf.keras.models.Sequential() Initializes a Sequential neural network, which means layers are stacked in order. Adding Layers

nn.add(tf.keras.layers.Dense(units=80, activation="relu", input_dim=X_train.shape[1])) nn.add(tf.keras.layers.Dense(units=30, activation="relu")) nn.add(tf.keras.layers.Dense(units=1, activation="sigmoid")) Dense Layers: Fully connected layers where each neuron connects to all neurons in the next layer. ReLU Activation: Helps neural networks learn complex relationships. Sigmoid Activation (Output Layer): Converts output into probability (0 to 1) for binary classification. Compiling the Model

nn.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) Loss Function (binary_crossentropy) → Best for binary classification problems. Optimizer (adam) → Adjusts learning rate dynamically for efficient training. Metric (accuracy) → Measures how well the model classifies the data. Saving the Model python Copy Edit nn.save("/Users/christinaland/Challenge 21/AlphabetSoupCharity.h5") Saves the trained model in HDF5 format (.h5). Allows you to reload and use the model later without retraining. 4️. SSL & Certifi (ssl & certifi) Why We Use It:

Some systems block HTTPS requests due to outdated security settings. ssl and certifi ensure secure internet connections when loading datasets. How It's Used:

import ssl import certifi ssl._create_default_https_context = ssl.create_default_context(cafile=certifi.where()) Forces Python to use updated SSL certificates to prevent data-loading errors. 5️. NumPy (numpy) (Implicitly Used by Pandas & TensorFlow) What It Does:

Provides fast mathematical operations for arrays. Used in the background by Pandas (for DataFrames) and TensorFlow (for matrix operations). 6️. HDF5 File Format (.h5 Files) What It Is:

HDF5 (.h5) is a binary file format for storing large numerical data, like neural networks. It preserves model weights and structure so you can reload it without retraining.

About

Alphabet Soup, a nonprofit foundation, aims to optimize its funding decisions by predicting the success of applicants using a deep learning model. This project utilizes machine learning and neural networks to create a binary classifier that determines whether an organization, if funded, will be successful based on historical data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published