“When I wrote this code, only God and I understood what I did. Now only God knows.”
If you relate to the above quote, I invite you to a bi-weekly series of “Clean code practices for data scientists.” seminars. As data scientists, there are a lot of design topics we can learn from the software development area. We will cover some of these topics in this seminar.
We will cover a very brief introduction to the topic and then we will work on a small project together. In these seminars, we will cover the topics to write better codes.
This seminar is for data scientists who want to improve their coding skills. It is not for beginners. It is for people who already know how to code and want to improve their coding skills. We will not cover the details of the topic. We will just mention the topic and then we will work on a small project together.
It is up to you to learn the details of the topic. I will provide you with the resources that I used to learn the topic.
Summery:
- A class is a user-defined blueprint (Interface is used in more advanced languages)
- In Python, classes are advanced dictionary
- Each class instance can have attributes attached to it for maintaining its state.
- Class instances can also have methods (defined by their class) for modifying their state.
- setter and getter methods can be used to act as a proxy for interacting with internal states of class
- str, and repr method can be used to prettify a print
- Inheritance helps to breakdown your code to hierarchy design
- dataclass is an awesome package to define data-oriented classes (vs. behavioral)
- you can freeze a class to achieve const behavior
- with the help of class, you can use JSON to store states
Summery:
- functional programming paradigm vs. object-oriented
- What is a decorator and how to use them
- Most useful decorators for DS
- Used
@singledispatch
,@singleton
,@lru_cache
decorator- introduction to
functools
package- Learned about closure and how to use them
Summery:
- We created a modules and imported it
- Package file structure is reviewd
- To define a package,
setup.py
file is used- We used
twine
to publish funsql module- We used
pip
to installfunsql
module
Summery:
- Design by contract is a programming paradigm that define the interface (inputs and output types) of a component.
- Test-driven development is a development process that relies on the test cases, then the code is improved so that the tests pass.
- You have to write the test before you write the code. For any new features, you have to write the test first, then write the code to pass the test.
assert
is build-in Python statement for writing test units, which is different fromraise
to handle on the fly errors.pytest
andunittest
modules, has more variation forassert
.- You may use
logging
module to record diagnostic information, while running the program. and define the level of severity and destination of the log seperatly for different part of the code, and change them for production and development without changing the code again.
Summary:
- Jupyter Notebooks is a valuable tool but not suitable for production code
- You can use
sys
andos
modules to pass arguments to the scriptInquirerPy
is an excelent library to prompt for user input- Add
__main__.py
to your module to make itself independent script- You may use
rich
library, along withbreakpoint()
to debug your code
Challange
- We have transactions for an e-commerce website, and we need to identify users' spending decrease (soft attrition problem). You can download data from: https://drive.google.com/file/d/1Q_ZU_IN-igiI16GKrz8FGqXgvLQmeEsD/view?usp=share_link.
- The transaction table has the following columns: userid, transactionid, transactiontime, numberofitemspurchased, costperitem
- Write your main.py file that gets input and output CSV file path as parameters
- Read the transaction table from the input file.
- Clean up data removing duplicate data and userid = -1
- Calculate the "aggregated_table" with each month's dollar value of transactions per user. (hint: what happens if a user does not have transactions in month X?)
- Add a new column, "soft_attrition", to aggregated_table, which is "1" if the total amount of transactions in the next three months is less than 25% of the past three months; otherwise, it is "0".
- The final output file should have the following columns: userid, month (in string format like 2021-10), soft_attrition
Two solution to this challenge is provided Here:
Even though you may already know these topics, I still think it might be a good idea to review and see some small tricks that help for better coding.