Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: changed the document example to include code/bold formatting #90

Merged
merged 1 commit into from
Feb 1, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 15 additions & 29 deletions docs/example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@
"# `datexplore` Example Usage\n",
"Here we will show how the datexplore package can be used for the early stages of a data analysis project. We will show example usages for each function in the package (`clean_names`, `visualise`, and `detect_outliers`). \n",
"\n",
"The early stages of data analysis projects often begin with similar steps. For many projects, data cleaning and exploratory data analysis are essential before beginning more complex analysis. Using clean data for your analysis can make your code less suceptible to bugs or errors. Additionally, performing exploratory data analysis can help to direct the analysis of your project and gives a stronger understanding of the data you are working with. \n",
"The early stages of data analysis projects often begin with similar steps. For many projects, data cleaning and **exploratory data analysis (EDA)** are essential before beginning more complex analysis. Using clean data for your analysis can make your code less suceptible to bugs or errors. Additionally, performing EDA can help to direct the analysis of your project and gives a stronger understanding of the data you are working with. \n",
"\n",
"This package aims to help with the early stages of a project. Specifically, it contains a function to clean the column names of tabular data, a function to detect outliers in numerical data, and a function to create useful visulaization for exploratory data analysis. "
"This package aims to help with the early stages of a project. Specifically, it contains a function to clean the column names of tabular data, a function to detect outliers in numerical data, and a function to create useful visulaization for EDA. "
]
},
{
Expand Down Expand Up @@ -38,10 +38,10 @@
"source": [
"## Clean names\n",
"\n",
"Often times raw data contains non syntactic column names. It can be particulary troublesome when the column names contain spaces and you are working with other packages which are designed only for column names without spaces.\n",
"Often times raw data contains **non syntactic column names**. It can be particulary troublesome when the column names contain spaces and you are working with other packages which are designed only for column names without spaces.\n",
"\n",
"#### For column name with a space:\n",
"An example of one such tool which does not work for column names with spaces the .query() method from the pandas library. This is shown below:"
"An example of one such tool which does not work for column names with spaces the `.query()` method from the `pandas` library. This is shown below:"
]
},
{
Expand Down Expand Up @@ -79,12 +79,12 @@
"metadata": {},
"source": [
"As you can see, using the column name containing a space results in an error. \n",
"Now, we can use the clean_names function to \"clean\" the column names of the data frame. By \"cleaning\" the column names, we mean that we make all column names in a dataframe such that the names only use letters, numbers, and underscores.\n",
"Now, we can use the `clean_names` function to \"clean\" the column names of the data frame. By \"cleaning\" the column names, we mean that we make all column names in a dataframe such that the names only use letters, numbers, and underscores.\n",
"\n",
"The clean_names function takes a pandas dataframe containing data with column names as an input. There is also an optional parameter, case, which specifies the capitalization structure of the output dataframe (more on this later). \n",
"The `clean_names` function takes a `pandas` dataframe containing data with column names as an input. There is also an optional parameter, case, which specifies the capitalization structure of the output dataframe (more on this later). \n",
"\n",
"#### For column names without spaces:\n",
"Below we use the clean_names function and show that the resulting dataframe can now be used with the .query() method."
"Below we use the `clean_names` function and show that the resulting dataframe can now be used with the `.query()` method."
]
},
{
Expand Down Expand Up @@ -234,15 +234,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This may not seem that useful for a dataframe with only two columns, but for a data frame with many columns, or if you are working with many dataframes, using the clean_names function could save a lot of time. "
"This may not seem that useful for a dataframe with only two columns, but for a data frame with many columns, or if you are working with many dataframes, using the `clean_names` function could save a lot of time. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exploring the case parameter: \n",
"The clean_names function also has an optional parameter, case, which specifics the capitalization structure of the output column names. The default value for this parameter is \"snake_case\" and the other options are \"CamelCase\" and \"lowerCamelCase\". snake_case uses only lowercase letters and spaces are replaced with underscores. \"CamelCase\" capitalizes the first letter of a name and every letter following a space. \"lowerCamelCase\" results in the first letter of the name being lowercase and the first letter following a space being capitalized. \n",
"The `clean_names` function also has an optional parameter, case, which specifics the capitalization structure of the output column names. The default value for this parameter is \"`snake_case`\" and the other options are \"`CamelCase`\" and \"`lowerCamelCase`\". snake_case uses only lowercase letters and spaces are replaced with underscores. \"`CamelCase`\" capitalizes the first letter of a name and every letter following a space. \"`lowerCamelCase`\" results in the first letter of the name being lowercase and the first letter following a space being capitalized. \n",
"Below are some examples using this optional parameter: "
]
},
Expand Down Expand Up @@ -866,7 +866,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In data analysis, outliers can either reveal critical insights or introduce annoying biases. The detect_outliers() function is an easy to use tool for detecting and categorizing outliers in Pandas Data Frames. This function uses the Interquartile Range (IQR) and standard deviation to identify and categorize its outliers. It then outputs the outliers to a Data Frame in a format that is simple to use and explore. By quickly identifying the most extreme outliers with our function, you can immediately get a sense of the scale of the problem the outliers might present. There are many real-world examples where disproportionate outliers make otherwise useful summary statistics unreliable. For example, detect_outliers() should be useful for real estate pricing. Data analysis on a real estate dataset can be compromised, when that dataset includes a few luxury homes priced significantly higher than the average. These extreme home values introduce a substantial skew, distorting the overall analysis. Our function enables you to swiftly identify and categorize these outliers. It also provides their index locations in the output Data Frame. Once you have the index of the outliers, all that's required is a few extra lines of code to remove these anomalous entries from your original dataset, ensuring a more balanced analysis!"
"In data analysis, outliers can either reveal critical insights or introduce annoying biases. The `detect_outliers()` function is an easy to use tool for detecting and categorizing outliers in Pandas Data Frames. This function uses the Interquartile Range (IQR) and standard deviation to identify and categorize its outliers. It then outputs the outliers to a Data Frame in a format that is simple to use and explore. By quickly identifying the most extreme outliers with our function, you can immediately get a sense of the scale of the problem the outliers might present. There are many real-world examples where disproportionate outliers make otherwise useful summary statistics unreliable. For example, `detect_outliers()` should be useful for real estate pricing. Data analysis on a real estate dataset can be compromised, when that dataset includes a few luxury homes priced significantly higher than the average. These extreme home values introduce a substantial skew, distorting the overall analysis. Our function enables you to swiftly identify and categorize these outliers. It also provides their index locations in the output Data Frame. Once you have the index of the outliers, all that's required is a few extra lines of code to remove these anomalous entries from your original dataset, ensuring a more balanced analysis!"
]
},
{
Expand All @@ -875,7 +875,7 @@
"source": [
"#### Usage example\n",
"\n",
"To give a simple demonstration on how the function is used, let's create a sample toy Data Frame (we can imagine the columns are features in the housing data set referenced above)."
"To give a simple demonstration on how the function is used, let's create a sample toy Data Frame (we can imagine the columns are features in a housing data set referenced above)."
]
},
{
Expand Down Expand Up @@ -926,11 +926,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The detect_outliers output returns the Data Frame shown above. A simple print function call will output the data in a clean, easy to read format.\n",
"The column name of the outlier, it's index, the value, it's deviation and a categorical description of how large the outlier is.\n",
"The outlier information can be used with pandas as needed to transform the original data frame.\n",
"The `detect_outliers` output returns the Data Frame shown above. A simple `print` function call will output the data in a clean, easy to read format.\n",
"The `column` name of the outlier, it's `index`, the `outlier_value`, it's `deviation` and a `category` description of how large the outlier is.\n",
"The outlier information can be used with `pandas` as needed to transform the original data frame.\n",
"\n",
"For example, to remove the outliers from the original Data Frame using the output from our detect_outliers function you would."
"For example, to remove the outliers from the original Data Frame using the output from our `detect_outliers` function you would:"
]
},
{
Expand Down Expand Up @@ -967,20 +967,6 @@
"df_outlier_free = df.drop(extreme_indices)\n",
"print(df_outlier_free)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
Loading