In this notebooks we explained the way we connect to the GoogleCloud BigQuery server to access de public data from big query datasets. We get the github_repos database that contains data from over 3 million repositories from github.
Finally, we extract a sample of the big query dataset and preprocessed it.
You can find more info about the dataset on Kaggle and Gooogle Cloud BigQuery.
The origin of the dataset is located on the public data from Google Clod Big Query
Kaggle has a notebook test example of the data extraction through the SohierDane's python library BigQueryHelper. A library that makes easier the access to big query, although you need the ADC Credencials from Google Cloud CLI.
Download the CLI from Google Cloud and create some credentials to use direcly big query on Python with the Python SDK Google Cloud.
After setting the credentials you'll need to config the client SDK from google cloud and adapt an IAM Google account to manage an enough quota of bytes/requests the the API.
Since we only could make some requests from kaggle's kernels and can't afford pay credits to GCP, we up to work diretly in BigQuery and download a sample of the next datasets:
SELECT * FROM `bigquery-public-data.github_repos.commits` LIMIT 10
SELECT * FROM `bigquery-public-data.github_repos.licenses` LIMIT 10
SELECT * FROM `bigquery-public-data.github_repos.sample_commits` LIMIT 10
SELECT * FROM `bigquery-public-data.github_repos.sample_repos` LIMIT 10
Note: We limit the query to 10 rows for explanation. The dataset used over 200,000 records and then sample again up to 15,000 records ready to use.
After downloading the dataset we applied some joins and drop columns in generate_dataset.ipynb. We save a dataframe in a csv file to import in our GitSoft Engine proyect.
Here's some charts of the kind of data we'll (this was taken from a SAMPLE of the whole dataset):
Follow the main repo GitSoft-Engine