jupytext | kernelspec | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Like it or loathe it, Microsoft's Excel spreadsheet programme is ubiquitous in our offices and homes. If you're looking to get off Excel, look no further because code can do everything Excel can-and a whole lot more!
The biggest difference between a fully-fledged programming language, such as Python, and a graphical tool for spreadsheet analysis is that you write out all of your instructions in code in the former, but only some instructions are written in code in the latter (such as cell formulae). Another big difference is that Python (and related tools like Visual Studio Code and git) are free!
The three main packages (extensions of the basic Python programming language) that cover Excel's functionality are pandas, which is for panel data analysis (such as spreadsheets). For some Excel functions, numpy, which provides tools for working with numbers, and matplotlib, which creates charts, will be necessary.
In this chapter, we'll focus on how to go from Excel to Python by replacing Excel. The situation where you want to work with Excel from Python (eg by populating a spreadsheet with data, formulae, and charts) will be covered in another chapter on automation. Your first port of call for replacing Excel should be the chapter on {ref}data-spreadsheets
. There is a more in-depth walkthrough of replacing Excel with Python here.
To get started with Excel, you open the application and create a new empty file. To get started with Python, follow the instructions in the preliminaries chapter of this book and create a new script (a file ending in .py
) in Visual Studio Code. You will also need to have the Python package pandas installed. The translations of common Excel tasks in the table below assume you have already run import pandas as pd
, and that you have your data in a pandas dataframe called df
(see the parts of this book on data analysis for more on using pandas). Although the table only covers a small fraction of what Excel does, and is certainly not comprehensive, it should give an idea of the ease and flexibility of pandas in replacing common Excel functions.
Action | Excel | Python (pandas) |
---|---|---|
View data | Open the file using Excel | df.head(n) to see the first n rows; use df to see all |
Count rows | Select a column with the mouse and look at the count indicator | df.count() |
Find Sum, Average, Max, or Min of the 1st column | eg, =SUM(A:A) |
eg, df.iloc[:, 0].sum() . Other commands are .mean() , .max() , and .min() |
Perform an operation on, eg sum, on the first row | =SUM(1:1) |
df.iloc[0, :].sum() |
Count occurrences of distinct items within a column | See here (complicated!) | df['column_name'].value_counts() |
Return true or false given values in another column based on a condition | =IF(A1<100, TRUE, FALSE) etc, with 1 changing according to row |
df['new_column'] = df['column_name'] < 100 |
Adding two columns | = A1 + B1 etc, with 1 changing according to row |
df['new_column'] = df['column_A'] + df['column_B'] |
Create a date column | Click on the column; select "Date" from list of number-format categories. | df['column'] = pd.to_datetime(df['column']) |
Count days between two dates | =DAYS(A1, B1) with 1 changing according to row |
df['new_column'] = df['column_A'] - df['column_B'] , where both existing columns are datetimes |
Create a cumulative sum on a column | =SUM(A$1:A10) where 10 is the maximum row index number |
df['new_columns'] = df['column'].cumsum() |
Concatenate string columns | =CONCATENATE(A1,B1) where 1 changes by row |
df['new_column'] = df['string_column_a'] + df['string_column_b'] |
Find out which cells in a column have invalid entries | =ISERROR(A1) where 1 changes by row |
df['column'].notna() |
Remove excess whitespace | =TRIM(A1) where 1 changes by row |
df['new_column'] = df['column'].replace("\s+", " ", regex=True) |
Transpose data | =TRANSPOSE(A1:C100) , assuming a 3 by 100 input data set between columns A and C, and rows 1 to 100 |
df.T |
Create a pivot table | Manually select data and use the various menus (example here) | pd.pivot_table(df, values='D', index=['A'], columns=['C'], aggfunc=np.sum) by summing over a column named 'D' with rows 'A', creating columns 'C' |
For more on how to replace Excel with Python and pandas, see the Data Analysis Quickstart and Data Analysis sections.