Clean code practices for data scientists

“When I wrote this code, only God and I understood what I did. Now only God knows.”

If you relate to the above quote, I invite you to a bi-weekly series of “Clean code practices for data scientists.” seminars. As data scientists, there are a lot of design topics we can learn from the software development area. We will cover some of these topics in this seminar.

What is This?

We will cover a very brief introduction to the topic and then we will work on a small project together. In these seminars, we will cover the topics to write better codes.

Target Audience

This seminar is for data scientists who want to improve their coding skills. It is not for beginners. It is for people who already know how to code and want to improve their coding skills. We will not cover the details of the topic. We will just mention the topic and then we will work on a small project together.

It is up to you to learn the details of the topic. I will provide you with the resources that I used to learn the topic.

Covered Topics

1. Object-oriented Programming:

Summery:

A class is a user-defined blueprint (Interface is used in more advanced languages)

In Python, classes are advanced dictionary

Each class instance can have attributes attached to it for maintaining its state.

Class instances can also have methods (defined by their class) for modifying their state.

setter and getter methods can be used to act as a proxy for interacting with internal states of class

str, and repr method can be used to prettify a print

Inheritance helps to breakdown your code to hierarchy design

dataclass is an awesome package to define data-oriented classes (vs. behavioral)

you can freeze a class to achieve const behavior

with the help of class, you can use JSON to store states

2. Functional programming and decorators

Summery:

functional programming paradigm vs. object-oriented

What is a decorator and how to use them

Most useful decorators for DS

Used @singledispatch, @singleton,@lru_cache decorator

introduction to functools package

Learned about closure and how to use them

3. Modules and packages

Summery:

We created a modules and imported it

Package file structure is reviewd

To define a package, setup.py file is used

We used twine to publish funsql module

We used pip to install funsql module

4. Testing and logging

Summery:

Design by contract is a programming paradigm that define the interface (inputs and output types) of a component.

Test-driven development is a development process that relies on the test cases, then the code is improved so that the tests pass.

You have to write the test before you write the code. For any new features, you have to write the test first, then write the code to pass the test.

assert is build-in Python statement for writing test units, which is different from raise to handle on the fly errors.

pytest and unittest modules, has more variation for assert.

You may use logging module to record diagnostic information, while running the program. and define the level of severity and destination of the log seperatly for different part of the code, and change them for production and development without changing the code again.

5. Executable python and debugging

Summary:

Jupyter Notebooks is a valuable tool but not suitable for production code

You can use sys and os modules to pass arguments to the script

InquirerPy is an excelent library to prompt for user input

Add __main__.py to your module to make itself independent script

You may use rich library, along with breakpoint() to debug your code

6. Data Manipulation Challenge

Challange

We have transactions for an e-commerce website, and we need to identify users' spending decrease (soft attrition problem). You can download data from: https://drive.google.com/file/d/1Q_ZU_IN-igiI16GKrz8FGqXgvLQmeEsD/view?usp=share_link.

The transaction table has the following columns: userid, transactionid, transactiontime, numberofitemspurchased, costperitem

Write your main.py file that gets input and output CSV file path as parameters

Read the transaction table from the input file.

Clean up data removing duplicate data and userid = -1

Calculate the "aggregated_table" with each month's dollar value of transactions per user. (hint: what happens if a user does not have transactions in month X?)

Add a new column, "soft_attrition", to aggregated_table, which is "1" if the total amount of transactions in the next three months is less than 25% of the past three months; otherwise, it is "0".

The final output file should have the following columns: userid, month (in string format like 2021-10), soft_attrition

Two solution to this challenge is provided Here:

The first one is using pandas which is a very popular library for data manipulation and the code is very concise.

and the second one is using polars module, which is a new library for data manipulation and it is very fast.

Conclusion

Even though you may already know these topics, I still think it might be a good idea to review and see some small tricks that help for better coding.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.vscode		.vscode
5_exercise		5_exercise
6_exercise		6_exercise
funsql		funsql
tests		tests
.gitignore		.gitignore
1_OOP_programming_in_python.ipynb		1_OOP_programming_in_python.ipynb
2_Functional_programming_and_decorators.ipynb		2_Functional_programming_and_decorators.ipynb
3_ Dont_loose_it_reuse_it.ipynb		3_ Dont_loose_it_reuse_it.ipynb
3_Modules_and_packages.ipynb		3_Modules_and_packages.ipynb
4_Testing_loging_debuging.ipynb		4_Testing_loging_debuging.ipynb
5_executable_python.ipynb		5_executable_python.ipynb
README.md		README.md
requirement.txt		requirement.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clean code practices for data scientists

What is This?

Target Audience

Covered Topics

1. Object-oriented Programming:

2. Functional programming and decorators

3. Modules and packages

4. Testing and logging

5. Executable python and debugging

6. Data Manipulation Challenge

Conclusion

About

Releases

Packages

Languages

jadaliha/DS_coding_practices

Folders and files

Latest commit

History

Repository files navigation

Clean code practices for data scientists

What is This?

Target Audience

Covered Topics

1. Object-oriented Programming:

2. Functional programming and decorators

3. Modules and packages

4. Testing and logging

5. Executable python and debugging

6. Data Manipulation Challenge

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages