This project focuses on building a data processing module using Python, incorporating advanced features like generators, comprehensions, and object-oriented programming. The goal is to extract and analyze year-over-year revenue and revenue by region from Bamazon Inc.'s sales data spanning 2015 to 2021.
- Mastering Python's advanced data structures and functionalities.
- Efficiently handling and parsing large data files.
- Implementing robust testing strategies with pytest.
- Explored the use of generators for memory-efficient data processing.
- Learned the significance of the
__iter__method in creating iterable objects.
- Utilized list comprehensions to simplify data manipulation and filtering tasks.
- Applied OOP principles to structure the data processing module, enhancing code modularity and reusability.
- Adopted pytest for writing and executing test cases, ensuring code reliability.
- Investigated techniques for processing large datasets without compromising performance.
- Encountered issues with Python module discovery.
- Resolved by correctly setting the
PYTHONPATHenvironment variable.
- Faced difficulties executing the main script due to directory structure.
- Overcame by adjusting the command line invocation to specify the correct module path.
- Identified and fixed syntax errors that were causing script failures.
Successfully developed and tested a Python module capable of:
- Generating insights from large datasets.
- Aggregating sales data to calculate revenue metrics.
- Producing reports including bar graphs and JSON files for data visualization.
- To optimize Python code using multiprocessing, focusing on reducing execution time for data processing tasks.
- Understanding Multiprocessing: Grasped the basic concepts of parallel processing and Python's
multiprocessingmodule. - Code Optimization: Successfully refactored Week 1's single-threaded code to a parallelized version using
multiprocessing. - Performance Measurement: Measured and compared execution times to quantify performance improvements.
- Global Interpreter Lock (GIL) Limitations: Overcame Python's GIL restrictions by employing multiprocessing to execute tasks in parallel.
- Dynamic Process Allocation: Implemented dynamic process allocation based on the system's available CPU cores to optimize performance.
- Technical Precision: Achieved a significant reduction in execution time by parallelizing data processing tasks.
- Educational Growth: Enhanced understanding of multiprocessing and its application for optimizing Python code.
- Communication Clarity: Demonstrated clear and effective communication through inquiries and code modifications.
- Fine-tuning Multiprocessing: Explore the impact of chunk sizes and the difference between process vs. thread pools.
- Understanding Parallelism Patterns: Investigate various parallel computing patterns for structuring computations efficiently.
- The journey through optimizing Python code with multiprocessing has enriched both technical skills and understanding of parallel computing principles. There's a demonstrated ability to effectively apply multiprocessing for performance optimization, showing significant progress in mastering Python for high-performance applications.
- Explore "High Performance Python" by Micha Gorelick and Ian Ozsvald to deepen knowledge on optimizing Python code.
- Enroll in Advanced Python Courses focusing on performance optimization and parallel computing on platforms like Coursera or Udemy.
- Engage in Real-world Projects or contribute to open-source projects involving data processing or scientific computing to apply multiprocessing knowledge practically.
- Make you're using Python version >= 3.9.0
- Install all the modules
pip install -r requirements.txt
Generate test data (useful for unit testing code)
python generate_data.py --type tst
Generate small data (useful for quick testing of logic)
python generate_data.py --type sml
Generate big data (actual data)
python generate_data.py --type bg
PYTHONPATH=../ pytest test.py -s
PYTHONPATH=./w1 pytest w1/test.py -s
PYTHONPATH=../ pytest test.py
PYTHONPATH=./w1 pytest w1/test.py
PYTHONPATH=../ python main.py --type tst
PYTHONPATH=./w1 python -m w1.main --type tst
PYTHONPATH=../ python main.py --type sml
PYTHONPATH=./w1 python -m w1.main --type sml
PYTHONPATH=../ python main.py --type bg
PYTHONPATH=./w1 python -m w1.main --type bg
PYTHONPATH=../ pytest test.py -s
PYTHONPATH=./w2 pytest w2/test.py -s
PYTHONPATH=../ pytest test.py
PYTHONPATH=./w2 pytest w2/test.py
PYTHONPATH=../ python main.py --type tst
PYTHONPATH=./w2 python -m w2.main --type tst
PYTHONPATH=../ python main.py --type sml
PYTHONPATH=./w2 python -m w2.main --type sml
PYTHONPATH=../ python main.py --type bg
PYTHONPATH=./w2 python -m w2.main --type bg