Repository dedicated to projects developed in the Master's Degree in Data Science at the University of Colorado Boulder.
The Exame Nacional do Ensino Médio (ENEM) holds a pivotal role in shaping Brazil's education landscape, serving as a determinant for student admissions into numerous higher education institutions in Brazil and even abroad. Leveraging the rich dataset that ENEM offers allows for a deep dive into the myriad factors that influence student performance and trends in education.
Our endeavor circles around the meticulous analysis of the ENEM 2022 dataset to forecast various outcomes using supervised learning techniques. The heart of our analysis can be bifurcated into two critical pathways:
-
Regression Analysis: Targeted at predicting quantifiable outcomes such as student scores, utilizing regression techniques to find out the underlying patterns and correlations in student performances.
-
Classification Analysis: Aiming to categorize students into different groups based on a set of predetermined criteria. It could be understanding their likelihood to pursue higher education or categorizing them based on their performance metrics.
By weaving through the extensive data trail of ENEM 2022, we aim to unearth insights that could potentially shape educational policies, assist educational institutions in crafting tailored strategies, and help students in making informed decisions.
We welcome data enthusiasts, educational researchers, and policy makers to collaborate and enrich this project with diverse perspectives and expertise. Together, we can strive to make a substantial impact in the educational sphere through data-driven insights and analysis.
Feel free to fork the repository, open issues, and submit pull requests. Your contributions will be duly acknowledged, and we look forward to building a rich repository of analyses and insights centered around the ENEM 2022 dataset.
Link: ENEM 2022 Project
In this project, we undertake a detailed analysis of the BBC News dataset leveraging both unsupervised and supervised learning techniques. The objective is to unravel hidden patterns and extract meaningful insights from the news articles, categorizing them accurately into predefined groups to facilitate efficient information retrieval and enhance reader experience.
To foster a deeper understanding of the news classification process, we have structured the repository into the following sections:
- Exploratory Data Analysis (EDA): This section offers a deep dive into the dataset, showcasing detailed text analysis through tokenization, stemming, lemmatization, and visualization of word statistics to provide a comprehensive overview of the data at hand.
- Unsupervised Learning Models: Here, we delve into the strategies adopted for unsupervised learning, exploring matrix factorization techniques and setting forth the evaluation metrics to gauge the performance of the built models.
- Supervised Learning Model Comparison: A segment dedicated to the comparative study of various supervised learning models, discussing their performance, data efficiency, and addressing potential overfitting issues.
- Limitations of sklearn’s Non-negative Matrix Factorization Library: In this part, we explore the limitations encountered while using the sklearn library, suggesting possible improvements and ways to overcome these challenges.
We welcome contributions from data enthusiasts, researchers, and policy makers to enrich the project with diverse perspectives and expertise. Feel free to fork the repository, open issues, and submit pull requests. Your contributions will be duly acknowledged, and we look forward to building a rich repository of analyses and insights centered around the BBC News dataset.
The heart of our analysis can be bifurcated into two critical pathways:
- Regression Analysis: Aimed at predicting quantifiable outcomes such as scores in various metrics, utilizing regression techniques to find underlying patterns and correlations in the data.
- Classification Analysis: This pathway focuses on categorizing news articles into different groups based on a set of predetermined criteria, enhancing the reader's ability to find articles aligned with their interests.
This project opens avenues for further research and development in the field of text classification, beckoning contributions that can aid in the crafting of educational strategies and fostering informed decision-making through data-driven insights derived from the analysis of news articles.
Link: BBC News Classification Project
Welcome to our project repository where we unravel the mysteries of single-cell perturbations in the dynamic field of biotechnology and data science. Leveraging a rich dataset from a Kaggle competition focused on open problems in single-cell perturbations, we aim to foster groundbreaking discoveries in cellular responses to small molecule drug perturbations.
Our project is centered around a novel dataset created using human peripheral blood mononuclear cells (PBMCs). This dataset, derived from a meticulous experiment involving 144 compounds from the LINCS Connectivity Map dataset, offers a rich multi-omic background, providing a fertile ground for establishing biological priors that elucidate the susceptibility of specific genes to perturbation responses in various biological contexts.
The primary goal of this endeavor is to develop predictive models capable of accurately forecasting cellular responses to small molecule drug perturbations. Our objectives are multi-faceted, encompassing:
- Data Exploration and Understanding: Through comprehensive EDA, we aim to unearth underlying patterns and grasp the biological context of the data.
- Model Development: Utilizing unsupervised learning techniques, primarily focusing on matrix factorization methods, we aspire to build predictive models and juxtapose them against supervised learning models to discern the strengths and weaknesses of each approach.
- Performance Evaluation: This involves rigorous testing of the models to gauge their predictive accuracy and robustness.
- Improvement and Optimization: We are committed to continually refining our models, overcoming limitations, and enhancing their predictive accuracy.
This project stands as a beacon in the confluence of data science and biotechnology, striving to spearhead advancements in medicine through predictive analysis of cellular responses to drug perturbations. It promises not only a substantial learning curve for participants but also a potential pathway to revolutionary discoveries in the medical field.
Our project is structured into two pivotal phases:
- Exploration and Understanding: A deep dive into the dataset to understand the biological context and uncover underlying patterns through detailed EDA.
- Model Development and Evaluation: This phase is dedicated to the development of predictive models using unsupervised learning techniques, followed by a rigorous evaluation to ascertain their performance and potential areas for improvement.
To get started with the project, navigate through the repository to find detailed documentation on each phase of the project, including the methodologies employed, the results obtained, and the conclusions derived from the analysis.
We invite collaborators and enthusiasts to join us in this analytical journey to unlock the potential of unsupervised learning in the realm of single-cell perturbations analysis. Feel free to contribute, suggest improvements, and raise issues as we collectively work towards a deeper understanding of cellular responses to drug perturbations.
Link: Open Problems Single Cell Perturbations
Welcome to our project repository, where we delve into the realm of histopathologic cancer detection. Leveraging a comprehensive dataset from a Kaggle competition, we aim to make significant strides in the field of digital pathology and machine learning.
Our project revolves around a dataset modified from the PatchCamelyon (PCam) benchmark dataset. This dataset consists of small patches of images taken from larger digital pathology scans. The data includes 220,025 training images and 57,458 test images, focusing on a 32x32 pixel region for binary classification - identifying metastatic cancer.
The primary goal of this initiative is to develop machine learning algorithms capable of identifying metastatic cancer in small patches of images. Our objectives include:
-
Data Exploration and Understanding: Through thorough EDA, we intend to discover underlying patterns and understand the medical context of the data.
-
Model Development: We are focusing on developing various machine learning models, including CNN, VGG-16, ResNet50, and InceptionV3, to evaluate their strengths and weaknesses in tackling this particular problem.
-
Performance Evaluation: Rigorous testing of the models based on the area under the ROC curve, the evaluation metric for the Kaggle competition.
-
Improvement and Optimization: We are committed to refining our models through hyperparameter tuning, data augmentation, and potentially ensemble methods to enhance their performance.
This project serves as an intersection between machine learning and digital pathology, aiming to advance medical diagnoses through the predictive analysis of pathology scans. It offers a steep learning curve for participants and opens doors for innovative solutions in cancer detection.
Our project is structured into two primary phases:
-
Exploration and Understanding: A comprehensive analysis of the dataset to understand its medical relevance and discover underlying patterns.
-
Model Development and Evaluation: This phase is dedicated to the construction and rigorous evaluation of machine learning models tailored to the specific challenges posed by histopathologic cancer detection.
To get involved with the project, navigate through the repository to find in-depth documentation on each phase, including methodologies, results, and conclusions.
We invite enthusiasts and collaborators to join us in this analytical journey to unlock the potential of machine learning in the field of histopathologic cancer detection. Feel free to contribute, suggest improvements, and raise issues as we collaboratively work towards more effective and efficient cancer detection solutions.
Link: Histopathologic Cancer Detection
Welcome to the Disaster Tweet Classification Project repository. This project aims to make a significant impact in the field of Natural Language Processing by tackling the problem of classifying tweets related to real-world disasters. The data for this initiative is derived from a Kaggle competition and serves as an excellent starting point for those interested in NLP.
This project revolves around a dataset comprising 7,613 training tweets and 3,263 test tweets. These tweets are annotated with various features like ID, text content, geographical location, and keyword. The main objective is to classify whether a tweet is related to a real disaster or not.
The primary goal of this project is to:
- Conduct extensive Exploratory Data Analysis (EDA) to understand the underlying patterns and contexts within the tweets.
- Develop machine learning models, focusing on NLP techniques, to classify tweets effectively.
- Evaluate the performance rigorously based on the F1 score, which is the evaluation metric for the Kaggle competition.
- Continuously refine and optimize the model through techniques like hyperparameter tuning.
The significance of this project lies at the intersection of Natural Language Processing and crisis management. It aims to automate the process of identifying urgent tweets, thus potentially aiding disaster relief organizations and news agencies.
Our project is structured into two primary phases:
- Exploration and Understanding: In-depth analysis and understanding of the dataset, its features, and the challenge it poses.
- Model Development and Evaluation: This phase is dedicated to the construction, training, and evaluation of machine learning models suited for this particular NLP challenge.
To dive into this project, navigate through the repository to find comprehensive documentation on each phase, including methodologies, results, and conclusions.
We invite contributors and enthusiasts to join us in this analytical journey. Feel free to contribute, suggest improvements, and raise issues as we collaboratively work towards a more effective disaster tweet classification system.
Link: Disaster Tweet Classification Project
Welcome to the Monet-Style Image Generation Project. This initiative bridges the gap between art and technology, utilizing Generative Adversarial Networks (GANs) to recreate the distinctive style of Claude Monet. Derived from a Kaggle competition, this project serves as an ideal springboard for those interested in the convergence of machine learning and art.
Our challenge revolves around creating a GAN capable of generating 7,000 to 10,000 images mirroring the style of Monet. This endeavor not only tests the limits of computer vision and generative modeling but also explores the intriguing domain where data science meets art.
The main goals of this project are to:
- Develop and train a GAN that can successfully mimic Monet's artistic style.
- Conduct thorough evaluations using the Memorization-informed Fréchet Inception Distance (MiFID) metric to ensure the quality and originality of the generated images.
- Explore the creative capacities of GANs in transcending traditional boundaries of art reproduction.
This project stands at the forefront of artistic innovation, demonstrating the potential of GANs in creating art. It’s a testament to how far the field of computer vision has evolved, showcasing the ability of algorithms to not just replicate but creatively contribute to the world of art.
Our project is structured into several key phases:
- Data Preparation: Understanding and processing the datasets of Monet paintings and photos.
- Model Development: Designing and training the generator and discriminator models within the GAN.
- Evaluation and Refinement: Rigorously evaluating the generated images using MiFID and refining the model for better performance.
To participate in this project, you can find detailed documentation on model architecture, training procedures, and evaluation methods within our repository. We provide resources and guidance every step of the way.
We welcome contributions from enthusiasts, artists, and data scientists alike. Your insights, improvements, and discussions are invaluable as we push the boundaries of what's possible in the fusion of art and machine learning.
Link: Monet-Style Image Generation with GANs
Welcome to the Invasive Species Detection Project using Computer Vision Techniques. This project aims to apply the latest technologies in computer vision and machine learning to address significant ecological challenges, specifically the monitoring of invasive species such as the invasive hydrangea.
The presence of invasive species like kudzu in Georgia and cane toads in over a dozen countries poses a substantial threat to the environment. Effective tracking of these species is essential, yet current methods are costly and inefficient due to the vast area that needs to be covered.
The main goal of this project is to develop computer vision algorithms that can accurately identify the presence of invasive species in images of forests and foliage, making monitoring more affordable and reliable.
This project highlights the potential of computer vision in contributing to ecological problem solutions, demonstrating how algorithms can assist in environmental conservation initiatives.
The project is divided into several key phases:
- Data Preparation: Processing of relevant image datasets.
- Model Development: Using machine learning techniques to train models capable of identifying invasive species.
- Evaluation and Refinement: Rigorous model evaluation using the AUC-ROC metric and continuous refinement for better performance.
You will find detailed documentation on model architecture, training procedures, and evaluation methods in our repository. We provide resources and guidance at every step of the way.
We invite enthusiasts, artists, and data scientists to contribute with insights, improvements, and discussions. Your contributions are valuable as we explore the limits of what's possible in the fusion of technology and environmental conservation.
Link: Invasive Species Detection with Computer Vision Techniques
This project conducts a detailed descriptive analysis of the New York City Shooting Incident dataset. We use analytical methods to extract insights and identify patterns in the shooting records. The goal is to present the findings during the third week of the Master's in Data Science course at the University of Colorado Boulder.
The repository is divided into specific sections to ensure a comprehensive understanding of the analysis process:
- About the Dataset and Project: Detailed description of the origin and structure of the dataset, including information on how it is updated and maintained.
- Dataset Description: Exploration of the metadata to provide a clear summary of each column, helping to understand the variables available for analysis.
- Importing, Cleaning, and Organizing: Processes of data importation, handling missing values, adjusting data types, and removing irrelevant columns for analysis.
- Visualizations and Analysis: Data aggregation and creation of visualizations to answer preliminary project questions. Includes georeferenced and temporal analyses of the incidents.
We encourage the participation of data scientists, criminologists, and policy makers to enrich this project. Contributions can be made by forking the repository, opening issues, and submitting pull requests. We value all contributions and are excited to collaborate on developing a robust analytical tool.
The project focuses on two main objectives:
- Pattern Discovery: Identification of patterns and relationships in the data that may guide crime prevention strategies.
- Considerations for Predictive Modeling: Although we have not developed a predictive model, we identified the presence of the STATISTICAL_MURDER_FLAG variable, which could be used as a response variable in future predictive analyses.
This project provides a foundation for future research in public safety and criminal analysis. We look forward to subsequent advancements that can refine predictive modeling techniques, contributing to urban safety through well-informed policies and strategic planning.
For more information and access to the dataset, visit the dataset link: NYPD Shooting Incident Data (Historic).
Link: New York City Shooting Incident Data Analysis.
This academic project conducts an in-depth analysis of COVID-19 data, focusing on daily records of confirmed cases and virus-related deaths. The data, collected and consolidated from various global sources, are crucial for understanding the pandemic's impact, particularly in countries with populations larger than Brazil. This project is part of the Master's in Data Science program at the University of Colorado Boulder and aims to present findings in the third week of the course.
The repository is organized into specific sections to facilitate a comprehensive understanding of the analysis process:
- About the Dataset and Project: Detailed description of the dataset's origin and structure, including information on updates and maintenance.
- Dataset Description: Exploration of the metadata to provide a clear summary of each column, aiding in understanding the variables available for analysis.
- Importing, Cleaning, and Organizing: Processes of data importation, handling missing values, adjusting data types, and removing irrelevant columns for analysis.
- Visualizations and Analysis: Data aggregation and creation of visualizations to answer preliminary project questions, including georeferenced and temporal analyses of the incidents.
We encourage participation from data scientists, epidemiologists, and policy makers to enrich this project. Contributions can be made by forking the repository, opening issues, and submitting pull requests. We value all contributions and are excited to collaborate on developing a robust analytical tool.
The project focuses on two main objectives:
- Pattern Discovery: Identification of patterns and relationships in the data that may guide public health strategies.
- Considerations for Predictive Modeling: While a predictive model has not been developed yet, the analysis could provide insights for future modeling efforts.
This project lays the groundwork for future research in public health and epidemiology. We look forward to subsequent advancements that can refine predictive modeling techniques, contributing to global health safety through well-informed policies and strategic planning.
The datasets used in this study include several key variables, such as:
- Countries: The name of the country.
- Lat and Long: Geographic coordinates for each location.
- Date: The date of the recorded data.
- Cases: Number of confirmed cases.
- Deaths: Number of confirmed deaths.
The data for this study were extracted from the following sources:
- Daily data on confirmed cases and deaths: Various global repositories.
- Country population reference: Relevant global datasets.
This project employs a structured methodology to analyze COVID-19 data, focusing on the following steps:
- Sample: Selection of countries with populations larger than Brazil to ensure a representative demographic scale.
- Explore: Examination of data through visualizations and descriptive statistics to identify trends and anomalies.
- Modify: Data wrangling to standardize datasets for accurate analysis, including handling missing values and data inconsistencies.
- Model: Application of statistical and machine learning models to estimate trends and predict future scenarios.
- Assess: Evaluation of model accuracy and reliability through cross-validation and analysis of results to determine public health implications.
This study aims not only to understand but also to systematically document the patterns of dissemination and impact of the virus, contributing to future research and interventions in global public health.
For more information and access to the dataset, visit the dataset link: COVID-19 Data Repository.
Link: Analysis of the Impact of COVID-19
Aqui está a versão atualizada da descrição para o seu projeto no GitHub, agora incluindo o link do projeto no Hugging Face:
This academic project delves into the data science job market in the United Kingdom, focusing on key trends such as salary variation, required skills, and the influence of company ratings on job attractiveness. The dataset, sourced from Glassdoor, provides detailed information on job listings, including company names, job titles, salary ranges, and required skills. This project is part of the Master's in Data Science curriculum and is aligned with the objectives of exploring the trends in demand for data scientists, both regionally and skill-wise. The findings will be presented as part of the course's final project evaluation.
The repository is divided into distinct sections to enable a clear understanding of the project and its objectives:
- About the Dataset and Project: This section provides a thorough description of the dataset, including its origin, key attributes, and maintenance.
- Dataset Description: Explores the metadata to clarify the variables available for analysis, such as company ratings, job locations, salary ranges, and required skills.
- Data Cleaning and Organization: Details the processes of importing, cleaning, and organizing the dataset, including how missing values and data inconsistencies were addressed.
- Visualizations and Analysis: Displays a variety of visualizations and analyses that answer the key project questions, focusing on salary trends, skill requirements, and company ratings.
Contributions from the data science community are highly encouraged. Data scientists, HR professionals, and students are welcome to fork the repository, open issues, and submit pull requests to improve the project's insights. Collaboration is highly appreciated to enhance the understanding of the evolving data science job market in the UK.
This project focuses on the following main objectives:
- Analyze Salary Variation: Understand how salaries for data science roles vary across different cities and regions in the UK.
- Identify In-Demand Skills: Discover the most sought-after skills for data scientists and their impact on salaries.
- Assess Company Ratings and Remote Opportunities: Explore whether highly-rated companies offer higher salaries or more opportunities for remote work.
This project lays a strong foundation for future analysis of the data science job market, particularly as it evolves post-pandemic. The visualizations and insights could serve as a valuable resource for job seekers and employers alike. Future enhancements could involve expanding the dataset to include international job markets and further refining the predictive models for salary trends and skill demand.
The dataset used in this project contains several key variables, such as:
- Company: The name of the company offering the job position.
- Company Score: The rating given to the company by employees.
- Job Title: The specific job role being advertised.
- Date: The date the job listing was posted.
- Salary: The estimated salary range for the position.
- Skills: A list of required skills for the job role.
- Estimation Type: Indicates whether the salary was estimated by the company or the job listing platform.
- Remote: Specifies whether the job is remote or on-site.
- City and Country: The location of the job.
The data for this study were extracted from the following source:
This project follows a structured methodology to analyze the UK job market for data scientists:
- Sample: Selection of job listings specific to the UK to reflect regional trends and market conditions.
- Explore: Utilize descriptive statistics and visualizations to identify trends in salary, skills, and company ratings.
- Modify: Clean and preprocess the dataset to ensure accurate and reliable analysis, including addressing missing values and formatting inconsistencies.
- Model: Implement data visualizations and predictive analyses to forecast salary trends and demand for specific skills.
- Assess: Evaluate the visualizations through user feedback and refinement to ensure clarity and usability.
This analysis aims to provide a comprehensive view of the job market for data scientists in the UK, guiding career planning and recruitment strategies.