Dataset extraction from GoogleCloud BigQuery

In this notebooks we explained the way we connect to the GoogleCloud BigQuery server to access de public data from big query datasets. We get the github_repos database that contains data from over 3 million repositories from github.

Finally, we extract a sample of the big query dataset and preprocessed it.

Dataset info on Kaggle

You can find more info about the dataset on Kaggle and Gooogle Cloud BigQuery.

Dataset info on Google Cloud

The origin of the dataset is located on the public data from Google Clod Big Query

Import the dataset

Kaggle has a notebook test example of the data extraction through the SohierDane's python library BigQueryHelper. A library that makes easier the access to big query, although you need the ADC Credencials from Google Cloud CLI.

Download the CLI from Google Cloud and create some credentials to use direcly big query on Python with the Python SDK Google Cloud.

After setting the credentials you'll need to config the client SDK from google cloud and adapt an IAM Google account to manage an enough quota of bytes/requests the the API.

Since we only could make some requests from kaggle's kernels and can't afford pay credits to GCP, we up to work diretly in BigQuery and download a sample of the next datasets:

SELECT * FROM `bigquery-public-data.github_repos.commits` LIMIT 10
SELECT * FROM `bigquery-public-data.github_repos.licenses` LIMIT 10
SELECT * FROM `bigquery-public-data.github_repos.sample_commits` LIMIT 10
SELECT * FROM `bigquery-public-data.github_repos.sample_repos` LIMIT 10

Note: We limit the query to 10 rows for explanation. The dataset used over 200,000 records and then sample again up to 15,000 records ready to use.

Preprocess the data

After downloading the dataset we applied some joins and drop columns in generate_dataset.ipynb. We save a dataframe in a csv file to import in our GitSoft Engine proyect.

Here's some charts of the kind of data we'll (this was taken from a SAMPLE of the whole dataset):

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
datasets		datasets
.gitignore		.gitignore
README.MD		README.MD
generate_dataset.ipynb		generate_dataset.ipynb
github_repos_dataset.csv		github_repos_dataset.csv
image-1.png		image-1.png
image-2.png		image-2.png
image.png		image.png
upload_data_on_cosmos.ipynb		upload_data_on_cosmos.ipynb
web_scrapping.ipynb		web_scrapping.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset extraction from GoogleCloud BigQuery

Dataset info on Kaggle

Dataset info on Google Cloud

Import the dataset

Preprocess the data

Top 10 repos per watchers

Top 10 authors per watchers

Top 10 list of languages used in a repo per watchers

More info about the next steps of the projec

About

Releases

Packages

Languages

Ricardo-Jaramillo/GitSoft-Engine

Folders and files

Latest commit

History

Repository files navigation

Dataset extraction from GoogleCloud BigQuery

Dataset info on Kaggle

Dataset info on Google Cloud

Import the dataset

Preprocess the data

Top 10 repos per watchers

Top 10 authors per watchers

Top 10 list of languages used in a repo per watchers

More info about the next steps of the projec

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages