Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
527b77f
Update README.md
gprana May 22, 2019
ea8a655
Update README.md
gprana Sep 11, 2019
1b260d7
Update README.md
gprana Sep 11, 2019
3302723
Update README.md
gprana Sep 11, 2019
273470c
Update README.md
gprana Sep 11, 2019
8533523
Update README.md
gprana Sep 11, 2019
5307740
Update README.md
gprana Sep 11, 2019
b4fa84e
Update README.md
gprana Sep 11, 2019
fca9635
Update README.md
gprana Sep 11, 2019
40d40dc
Add .gitignore, updated scripts to replace sklearn.cross_validation w…
gprana Nov 15, 2019
090cc4c
Updated .gitignore
gprana Nov 15, 2019
3286bdd
Updated README.md
gprana Mar 16, 2020
d839721
Updated README.md
gprana Mar 16, 2020
c1e5191
Reorganized and updated scripts to avoid import error when the script…
gprana Mar 28, 2020
acc1ba3
Remove unneeded __init__.py
gprana Mar 28, 2020
3de5008
Update README.md
gprana Apr 5, 2020
0667af5
Fix issue due related to encoding and changed set_value() to loc()
gprana Aug 10, 2020
03e8417
Remove __pycache__
gprana Aug 11, 2020
82454ab
Add requirements.txt
gprana Aug 11, 2020
dfb09c7
Bump lxml from 4.5.2 to 4.6.2
dependabot[bot] Jan 7, 2021
306b1c5
Merge pull request #5 from gprana/dependabot/pip/lxml-4.6.2
gprana Jan 14, 2021
6afa82a
Bump markdown2 from 2.3.9 to 2.4.0
dependabot[bot] Jun 2, 2021
7c56193
Merge pull request #7 from gprana/dependabot/pip/markdown2-2.4.0
gprana Sep 10, 2021
badb080
Bump lxml from 4.6.2 to 4.6.3
dependabot[bot] Sep 10, 2021
174f482
Merge pull request #6 from gprana/dependabot/pip/lxml-4.6.3
gprana Nov 24, 2021
37aa37e
Bump nltk from 3.5 to 3.6.5
dependabot[bot] Nov 24, 2021
6ff8ac8
Bump lxml from 4.6.3 to 4.6.5
dependabot[bot] Dec 13, 2021
44a292c
Merge pull request #8 from gprana/dependabot/pip/nltk-3.6.5
gprana Jan 20, 2022
6858315
Merge pull request #9 from gprana/dependabot/pip/lxml-4.6.5
gprana Jan 20, 2022
b1ce618
Bump nltk from 3.6.5 to 3.6.6
dependabot[bot] Jan 20, 2022
19b03ba
Merge pull request #10 from gprana/dependabot/pip/nltk-3.6.6
gprana Jan 20, 2022
c52da46
Bump numpy from 1.19.1 to 1.21.0
dependabot[bot] Jan 20, 2022
9dc3d7f
Create LICENSE.md
gprana Jan 20, 2022
a7ca407
Update README.md to add license information
gprana Jan 20, 2022
5b6a5d6
Merge pull request #11 from gprana/dependabot/pip/numpy-1.21.0
gprana Jan 20, 2022
2e7bce9
Bump numpy from 1.21.0 to 1.22.0
dependabot[bot] Jun 22, 2022
7d0e7d3
Bump lxml from 4.6.5 to 4.9.1
dependabot[bot] Jul 6, 2022
d1dbd39
Merge pull request #12 from gprana/dependabot/pip/numpy-1.22.0
gprana Aug 19, 2022
f6aa84a
Merge pull request #13 from gprana/dependabot/pip/lxml-4.9.1
gprana Aug 19, 2022
c942d97
solving incompatibility
vmmelo Jan 8, 2023
871ed85
import script functions
vmmelo Jan 8, 2023
6f6c0a4
fix: config directory paths
vmmelo Jan 13, 2023
045e424
files
vmmelo Jan 13, 2023
52a8ba9
fix: cleaning files was removing necessary files
vmmelo Jan 13, 2023
124bb4b
fix: cleaning files was removing necessary files
vmmelo Jan 13, 2023
71263a2
save to s3: content type
vmmelo Jan 13, 2023
b4d0d02
main receive language as argument
vmmelo Jan 16, 2023
56627e5
add folders to git
vmmelo Jan 20, 2023
b919781
deletando arquivos
vmmelo Jan 20, 2023
c02648b
empty file
vmmelo Jan 20, 2023
0b4fa30
test
vmmelo Jan 20, 2023
06b31bb
logger
vmmelo Feb 10, 2023
1cbfac3
loglevel
Feb 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
log/*
!log/empty
model/*
script/__pycache__/*
script/helper/__pycache__/*
temp/*
!temp/empty
readmeclassifier/*
22 changes: 22 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
MIT License

Copyright (c) 2018 Gede Artha Azriadi Prana, Christoph Treude, Ferdian Thung,
Thushari Atapattu, and David Lo

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
66 changes: 45 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,52 @@
## What
This project contains the source code of GitHub README content classifier from the paper "Categorizing the Content of GitHub README Files" (Gede Artha Azriadi Prana, Christoph Treude, Ferdian Thung, Thushari Atapattu, David Lo), published in 2018 in Empirical Software Engineering. DOI: [10.1007/s10664-018-9660-3](https://link.springer.com/article/10.1007%2Fs10664-018-9660-3)

## Setup
This project is written in Python 3. It also uses SQLite to store intermediary data during processing. By default the database is `database/data.db`.

The code requires creation of some directories for logging and temporary file storage. Please create these prior to running the scripts:
1. `log/`
2. `temp/abstracted_markdown/`
3. `temp/abstracted_html/`
4. `temp/target_abstracted_markdown/`

If you want to train a model using provided dataset to predict labels in new file that's not in the set, you'll also need to create the following directories:

5. `model/`. Used by `classifier_train_model.py` to save result of training. `classifier_classify_target.py` loads model saved in this directory for classifying sections in user-provided README file.
6. `input/clf_target_readmes/`. The default place to store README files whose section labels are to be predicted.
7. `output/`. Used by `classifier_classify_target.py` saves its resulting labels here.

## How to Use
This project is written in Python 3.

### Cross-validation Experiments
1. Set up file paths in `config/config.cfg`. By default, CSV files listing the section titles and their labels are in `input/`. `dataset_1.csv` contains the section titles and labels for the development set, whereas `dataset_2.csv` contains the section titles and labels for the evaluation set. The README files corresponding to the CSV files are in `input/ReadMes/` directory.
2. Empty all database tables by running the script `script/loading/empty_all_tables.py`
3. Run `script/loading/load_section_dataset_25pct.py` to extract and load section overview (title text, labels) and content of development set into database.
4. Run `script/loading/load_section_dataset_75pct.py` to extract and load section overview (title text, labels) and content of evaluation set into database.
5. Run the `script/experiment/*` scripts as required. E.g. `script/experiment/classifier_75pct_tfidf.py` for the SVM version.

### Training Model on Existing Data and CLassifying New Files
1. Run `script/classifier/load_combined_set_and_train_model` to extract and load contents and titles listed in combined development and evaluation sets (by default, defined as `dataset_combined.csv` in `config/config.cfg`) into the database.
2. Run `script/classifier/load_and_classify_target` to extract and load contents of the README files in the directory specified in `target_readme_file_dir` variable in `config/config.cfg`.
3. By default, the resulting section labels will be saved in `output/output_section_codes.csv`. Classifier will also identify which codes exist for each file, and which codes don't yet exist (i.e. potential for README expansion). This information will be saved in `output/output_file_codes.csv`

### Training Model on Existing Data and Classifying New Files (Partial Steps)
1. Run `script/loading/load_section_dataset_combined.py` to extract and load section overview (title text, labels) and content of combined development and evaluation sets (by default, defined as `dataset_combined.csv` in `config/config.cfg`) into the database.
2. Place the README files whose sections are to be classified in the directory specified in `target_readme_file_dir` variable in `config/config.cfg`.
3. Run `script/loading/load_target_section_data.py` to load the section heading and content data into database.
4. Run `script/classifier/classifier_train_model.py`. This script will train SVM model using combined dataset in `*combined` tables. The resulting model, TFIDF vectorizer, and matrix label binarizer will be saved in `model/` directory.
5. Run `script/classifier/classifier_classify_target.py`. This script will use the saved model, vectorizer, and binarizer to classify target README files in the directory specified in `target_readme_file_dir` variable in `config/config.cfg`.
6. By default, the resulting section labels will be saved in `output/output_section_codes.csv`. Classifier will also identify which codes exist for each file, and which codes don't yet exist (i.e. potential for README expansion). This information will be saved in `output/output_file_codes.csv`
The following sections describe three use cases and the steps to follow for each scenario. Before running each use case, empty the database using `empty_all_tables.py` .

### Use Case 1: Running Cross-validation Experiments
1. Set up file paths in `config/config.cfg`. By default, CSV files listing the section titles and their labels are in `input/`. `dataset_1.csv` contains the section titles and labels for the development set, whereas `dataset_2.csv` contains the section titles and labels for the evaluation set. The README files corresponding to the CSV files are in `input/ReadMes/` directory.
2. Load development (i.e. the 25% used to develop heuristics) and evaluation (i.e. the remaining 75%) datasets by running `script/load_dev_and_eval_datasets.py`
3. Run the `script/experiment/*` scripts as required. E.g. to run experiment comparing cross-validation result on different algorithms, run `script/experiment_classifier_validation.py`.

### Use Case 2: Training Model on Existing Data and Classifying New Files
1. Run `script/load_combined_set_and_train_model` to extract and load contents and titles listed in combined development and evaluation sets into the database. This script by default reads `dataset_combined.csv` for section heading and labels, and the README files in `input/ReadMes/` directory for the section contents.
2. Download the new README file(s) whose sections are to be labeled into a directory.
3. Open configuration file of the classifier (`config/config.cfg`), and edit the `target_readme_file_dir` variable to point to the location of the README file(s) to be labeled.
4. Run `script/load_and_classify_target` to extract contents of the new README files, load the section contents, and perform classification.
5. By default, the resulting section labels will be saved in `output/output_section_codes.csv`. Classifier will also identify which codes exist for each file, and which codes don't yet exist (i.e. potential for README expansion). This information will be saved in `output/output_file_codes.csv`

### Use Case 3: Training Model on Existing Data and Classifying New Files (More Detailed Breakdown)
Each script used in the previous section automates multiple steps in the workflow to make usage simpler. If you want more detailed breakdown, e.g. to facilitate evaluation of intermediary result after each step in the workflow, please use the following steps.

#### Training Model Using Existing Data
1. Run `script/load_section_dataset_combined.py`. This script extracts and loads section overview (title text, labels) from CSV file containing complete set of section headings and labels. In `config/config.cfg`, this CSV file is specified as `dataset_combined.csv` by default. The script also loads section content of the associated README files. All these data are subsequently stored in tables with name ending in `combined` in the database.
2. Run `script/classifier_train_model.py`. This script will train SVM model using the data in `*combined` database tables. The resulting model, TFIDF vectorizer, and matrix label binarizer will be saved in `model/` directory.
#### Loading New File
3. Download the new README file(s) whose sections are to be labeled into a directory.
4. Open configuration file of the classifier (`config/config.cfg`), and edit the `target_readme_file_dir` variable to point to the location of the README file(s) to be labeled.
5. Run `script/load_target_sections.py` to load the section heading and content data into database.
#### Classifying Sections in the New File
6. Run `script/classifier_classify_target.py`. This script will use the saved model, vectorizer, and binarizer to classify target README files in the directory specified in `target_readme_file_dir` variable in `config/config.cfg`.
7. By default, the resulting section labels will be saved in `output/output_section_codes.csv`. Classifier will also identify which codes exist for each file, and which codes don't yet exist (i.e. potential for README expansion). This information will be saved in `output/output_file_codes.csv`

## Notes
All scripts will log output (such as F1 score, execution times) into `log/` directory. Preprocessed README files (with numbers, `mailto:` links etc. abstracted out) are saved in `temp/` directory. Patterns used for heuristics are listed in `doc/Patterns.ods`.

## License
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT). Please refer to [LICENSE.md](LICENSE.md)
Empty file added __init__.py
Empty file.
30 changes: 16 additions & 14 deletions config/config.cfg
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
[DEFAULT]
db_filename = ../../database/data.db
section_overview_25pct_filename = ../../input/dataset_1.csv
section_overview_75pct_filename = ../../input/dataset_2.csv
section_overview_combined_filename = ../../input/dataset_combined.csv
readme_file_dir = ../../input/dev_and_eval_readmes/
temp_abstracted_html_file_dir = ../../temp/abstracted_html/
temp_abstracted_markdown_file_dir = ../../temp/abstracted_markdown/
db_filename = READMEClassifier/database/data.db
section_overview_25pct_filename = READMEClassifier/input/dataset_1.csv
section_overview_75pct_filename = READMEClassifier/input/dataset_2.csv
section_overview_combined_filename = READMEClassifier/input/dataset_combined.csv
rng_seed = 100
vectorizer_filename = ../../model/vectorizer.clf
binarizer_filename = ../../model/binarizer.clf
model_filename = ../../model/model.clf
target_readme_file_dir = ../../input/clf_target_readmes/
temp_target_abstracted_markdown_file_dir = ../../temp/target_abstracted_markdown/
output_section_code_filename = ../../output/output_section_codes.csv
output_file_codes_filename = ../../output/output_file_codes.csv
vectorizer_filename = READMEClassifier/model/vectorizer.clf
binarizer_filename = READMEClassifier/model/binarizer.clf
model_filename = READMEClassifier/model/model.clf
# For use by readme content extractor scripts
readme_file_dir = READMEClassifier/input/dev_and_eval_readmes/
target_readme_file_dir = READMEClassifier/input/clf_target_readmes/
temp_abstracted_html_file_dir = READMEClassifier/temp/abstracted_html/
temp_abstracted_markdown_file_dir = READMEClassifier/temp/abstracted_markdown/
temp_target_abstracted_markdown_file_dir = READMEClassifier/temp/target_abstracted_markdown/
# For use by classifier
output_section_code_filename = READMEClassifier/output/output_section_codes.csv
output_file_codes_filename = READMEClassifier/output/output_file_codes.csv
Binary file modified database/data.db
Binary file not shown.
Empty file added log/empty
Empty file.
10 changes: 10 additions & 0 deletions logger.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
import logging
import os
DEBUG = os.getenv('DEBUG', 'false').lower() == 'true'
logger = logging.getLogger('READMEClassifier')

if DEBUG:
logger.setLevel(logging.DEBUG)
logger.warning('Running in debug mode')
else:
logger.setLevel(logging.ERROR)
101 changes: 101 additions & 0 deletions output/output_file_codes.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
local_readme_file,section_codes_in_file,codes_not_in_file
2dust.v2rayNG.md,"-,1,3,6","4,5,7,8"
ACRA.acra.md,"1,4,6","3,5,7,8"
afollestad.material-dialogs.md,"-,1,3,4,6","5,7,8"
agrosner.DBFlow.md,"1,3,4,5,6,7",8
airbnb.mavericks.md,"1,3,6","4,5,7,8"
alibaba.p3c.md,"1,3,4,6","5,7,8"
android.architecture-components-samples.md,"1,5,6","3,4,7,8"
android.architecture-samples.md,",1,3,5,6,8","4,7"
android.camera-samples.md,",1","3,4,5,6,7,8"
android.compose-samples.md,"-,1,3,5,6","4,7,8"
android.nowinandroid.md,"1,3,5,6,8","4,7"
android.sunflower.md,",1,3,5,6,8","4,7"
android.topeka.md,"1,3,5,6,8","4,7"
android.uamp.md,"-,1,3,5,6,7","4,8"
android10.Android-CleanArchitecture-Kotlin.md,"-,1,3,5,6","4,7,8"
androidx.androidx.md,"1,3,5,6,7","4,8"
ankidroid.Anki-Android.md,"-,1,3,5,6,7","4,8"
AppIntro.AppIntro.md,",-,1,3,5,6,7","4,8"
arrow-kt.arrow.md,"-,1,3,4,5,6","7,8"
bannedbook.fanqiang.md,-,"1,3,4,5,6,7,8"
bennyhuo.Kotlin-Tutorials.md,-,"1,3,4,5,6,7,8"
cashapp.sqldelight.md,"-,3,5","1,4,6,7,8"
chrisbanes.cheesesquare.md,",3,5","1,4,6,7,8"
chrisbanes.tivi.md,",1,3,5,6,7","4,8"
coil-kt.coil.md,"3,5","1,4,6,7,8"
corda.corda.md,"1,5,6,7","3,4,8"
dbacinski.Design-Patterns-In-Kotlin.md,",-,1,3,4,6","5,7,8"
detekt.detekt.md,",1,3,5,6,7","4,8"
didi.booster.md,",-,1,3,5,6,7","4,8"
diogobernardino.williamchart.md,",-,1,3,5,6","4,7,8"
drakeet.MultiType.md,",1,3,5,8","4,6,7"
ethereum-lists.chains.md,",-,1,3,4,5,6","7,8"
facebook.facebook-android-sdk.md,",1,3,5,6,7","4,8"
gedoor.legado.md,"-,1,3,5,6","4,7,8"
google.accompanist.md,",-,1,3,4,5,6,7",8
google.flexbox-layout.md,",-,1,3,5,7","4,6,8"
google.iosched.md,",-,1,3,6","4,5,7,8"
Gurupreet.ComposeCookBook.md,",1,3,6,7","4,5,8"
hectorqin.reader.md,"-,4,6","1,3,5,7,8"
igorwojda.android-showcase.md,",1,3,5,6,7","4,8"
InsertKoinIO.koin.md,"-,1,3,5,6,7,8",4
intellij-rust.intellij-rust.md,",1,3,7","4,5,6,8"
iSoron.uhabits.md,"1,3,5,7","4,6,8"
izhangzhihao.intellij-rainbow-brackets.md,",-,1,3,5,6,7","4,8"
JakeWharton.RxBinding.md,"1,3,5,6","4,7,8"
JakeWharton.timber.md,"3,5,6","1,4,7,8"
javalin.javalin.md,",-,1,3,5,6","4,7,8"
JetBrains.compose-jb.md,"1,3,4,6","5,7,8"
JetBrains.Exposed.md,",-,1,3,4,5,6,7",8
JetBrains.ideavim.md,"1,3,5,6,7","4,8"
JetBrains.kotlin-native.md,6,"1,3,4,5,7,8"
JetBrains.kotlin.md,"-,1,3,5,6,7","4,8"
kickstarter.android-oss.md,"-,3,5,6,7","1,4,8"
kittinunf.fuel.md,",-,3,5,6","1,4,7,8"
kotest.kotest.md,"1,6","3,4,5,7,8"
Kotlin.anko.md,"1,3,6,7","4,5,8"
Kotlin.kotlinx.coroutines.md,",3,6,7","1,4,5,8"
Kotlin.kotlinx.serialization.md,"-,1,3,6","4,5,7,8"
KotlinBy.awesome-kotlin.md,",1,6","3,4,5,7,8"
Kr328.ClashForAndroid.md,",-,1,3,5,6","4,7,8"
ktorio.ktor.md,",-,1,3,5,6,7","4,8"
libre-tube.LibreTube.md,5,"1,3,4,6,7,8"
mamoe.mirai.md,"-,1,5,6","3,4,7,8"
mikepenz.Android-Iconics.md,",-,1,3,5,6,8","4,7"
mikepenz.MaterialDrawer.md,",-,1,3,5,6,8","4,7"
mockk.mockk.md,",-,1,3,4,5,6,8",7
moezbhatti.qksms.md,",1,3,5,6,7","4,8"
mozilla-mobile.fenix.md,"1,3,5,6,7","4,8"
muzei.muzei.md,"1,3,6","4,5,7,8"
nickbutcher.plaid.md,"-,1,3,5,6,8","4,7"
pinterest.ktlint.md,"1,5,6","3,4,7,8"
pppscn.SmsForwarder.md,"-,5,6","1,3,4,7,8"
ReactiveX.RxKotlin.md,",-,1,3,4,6,7","5,8"
SagerNet.SagerNet.md,",-,3,5,6","1,4,7,8"
Shabinder.SpotiFlyer.md,"-,3,5,7","1,4,6,8"
shadowsocks.shadowsocks-android.md,"-,1,3,5,6,7","4,8"
sharish.ShimmerRecyclerView.md,"-,1,3,5,6","4,7,8"
skydoves.android-developer-roadmap.md,",-,3,5,7","1,4,6,8"
skydoves.Pokedex.md,"-,1,3,5,6","4,7,8"
sourcerer-io.sourcerer-app.md,",-,1,3,5,6,7","4,8"
square.leakcanary.md,",-,1,5","3,4,6,7,8"
square.moshi.md,",1,3,5,6","4,7,8"
square.okhttp.md,"1,3,5,6","4,7,8"
square.okio.md,"1,5,6","3,4,7,8"
square.picasso.md,"1,3,5","4,6,7,8"
square.wire.md,"1,3,6","4,5,7,8"
ssseasonnn.RxDownload.md,",3,5","1,4,6,7,8"
tachiyomiorg.tachiyomi.md,",1,3,5,6,7","4,8"
Tamsiree.RxTool.md,"-,1,4,5","3,6,7,8"
Tapadoo.Alerter.md,"-,1,3,5,6,7","4,8"
TeamVanced.VancedManager.md,"-,1,3,6,7","4,5,8"
thundernest.k-9.md,",-,1,4,6,7","3,5,8"
Triple-T.gradle-play-publisher.md,",-,1,3,4,6","5,7,8"
uber.RIBs.md,"-,1,3,5,6","4,7,8"
wasabeef.recyclerview-animators.md,",-,1,3,5,6,7","4,8"
yairm210.Unciv.md,",-,1,3,4,5,6,7",8
Yalantis.Context-Menu.Android.md,"-,1,3,4,5,6","7,8"
YiiGuxing.TranslationPlugin.md,"-,1,3,5,6,8","4,7"
yujincheng08.BiliRoaming.md,"-,1,5,6","3,4,7,8"
zetbaitsu.Compressor.md,",-,1,3,5,6","4,7,8"
Loading