Skip to content

Repository for analysis and experiments in the BigCode project.

License

Notifications You must be signed in to change notification settings

hughesthe1st/bigcode-analysis

 
 

Repository files navigation

BigCode Analysis

This repository is for the analysis done in BigCode Project. You can find analysis of datasets, models, architecture choices and more.

Contents

  • Data analysis: In the folder data_analysis, we analyze these two datasets: python-all-license (private) and python-safe-license. We provide the following statistics:
    • percentage of near duplicates
    • percentage of configuration/test and uncommon files
    • file size distribution
    • loss analysis
    • natural language distribution in comments/docstrings and number of files that can be successfully compiled

We also provide code to run near-deduplication, and to detect natural language of comments in Python datasets.

  • Multi query attention experiments, for details refer here

About

Repository for analysis and experiments in the BigCode project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 96.5%
  • Python 3.4%
  • Shell 0.1%