Skip to content

Commit

Permalink
Polishing code base, testing for scenarios.
Browse files Browse the repository at this point in the history
  • Loading branch information
stefanbringuier committed Apr 13, 2022
1 parent 490d905 commit 78b9ebf
Show file tree
Hide file tree
Showing 11 changed files with 142 additions and 80 deletions.
Empty file added Artifacts.toml
Empty file.
41 changes: 36 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# CBFV.jl : A simple composition-based feature vectorization Julia utility
# CBFV.jl : A simple composition-based feature vectorization utility in Julia
[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://juliamatsci.github.io/CBFV.jl/stable) [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://juliamatsci.github.io/CBFV.jl/dev) [![Build Status](https://github.com/juliamatsci/CBFV.jl/workflows/CI/badge.svg)](https://github.com/JuliaMatSci/CBFV.jl/actions) [![Build Status](https://travis-ci.com/JuliaMatSci/CBFV.jl.svg?branch=master)](https://travis-ci.com/JuliaMatSci/CBFV.jl) [![Coverage](https://codecov.io/gh/JuliaMatSci/CBFV.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/JuliaMatSci/CBFV.jl)

This is a Julia rewrite of the [python tool](https://github.com/kaaiian/CBFV) to create a composition-based feature vector representation for machine learning with materials science data. The ideas and methodology are discussed in the recent article:
Expand All @@ -8,11 +8,42 @@ This is a Julia rewrite of the [python tool](https://github.com/kaaiian/CBFV) to
and the original python source code(s) can be found here:

> https://github.com/anthony-wang/BestPractices/tree/master/notebooks/CBFV
- [https://github.com/anthony-wang/BestPractices/tree/master/notebooks/CBFV](https://github.com/anthony-wang/BestPractices/tree/master/notebooks/CBFV)
- [https://github.com/kaaiian/CBFV](https://github.com/kaaiian/CBFV)

> https://github.com/kaaiian/CBFV
## Example Use

## Citation
The input data set should have a least two columns with the header/names `formula` and `target`.

```@example
using DataFrames
using CBFV
data = DataFrame("name"=>["Rb2Te","CdCl2","LaN"],"bandgap_eV"=>[1.88,3.51,1.12])
rename!(data,Dict("name"=>"formula","bandgap_eV"=>"target"))
features = generatefeatures(data)
```

## Supported Featurization Schemes

As with the orignal CBFV python package the following element databases are available:

- `oliynyk` (default): Database from A. Oliynyk.
- `magpie`: [Materials Agnostic Platform for Informatics and Exploration](https://bitbucket.org/wolverton/magpie/src/master/)
- `mat2vec`: [Word embeddings capture latent knowledge from materials science](https://github.com/materialsintelligence/mat2vec)
- `jarvis`: [Joint Automated Repository for Various Integrated Simulations provided by U.S. National Institutes of Standards and Technologies.](https://jarvis.nist.gov/)
- `onehot`: Simple one hot encoding scheme, i.e., diagonal elemental matrix.
- `random_200`: 200 random elemental properties (I'm assuming).

However, `CBFV.jl` will allow you to provide your own element database to featurize with. Also, the current implementation reads the saved `.csv` file in [`databases`](@ref), however, this is prone to potential issues (ex. out of date files). To alleviate this I will change the implementation to utilize `Pkg.Artificats` with a `Artificats.toml` file that enables grabbing the datafiles needed from a server if they don't exist locally already.

### Julia Dependencies
This is a relatively small package so there aren't a lot of dependencies. The required packages are:

- CSV
- DataFrames
- ProgressBars

## Citations
Pleae cite the following when and if you use this package in your work:

```bibtex
Expand Down Expand Up @@ -46,4 +77,4 @@ In addition, please also consider citing the original python implementation and
journal = {Chemistry of Materials},
doi = {10.1021/acs.chemmater.0c01907}
}
```
```
6 changes: 0 additions & 6 deletions TODO.md

This file was deleted.

6 changes: 3 additions & 3 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ using Documenter
makedocs(;
modules=[CBFV],
authors="Stefan Bringuier <stefanbringuier@gmail.com> and contributors",
repo="https://github.com/stefanbringuier/CBFV.jl/blob/{commit}{path}#L{line}",
repo="https://github.com/JuliaMatSci/CBFV.jl/blob/{commit}{path}#L{line}",
sitename="CBFV.jl",
format=Documenter.HTML(;
prettyurls=get(ENV, "CI", "false") == "true",
canonical="https://stefanbringuier.github.io/CBFV.jl",
canonical="https://JuliaMatSci.github.io/CBFV.jl",
assets=String[],
),
pages=[
Expand All @@ -20,5 +20,5 @@ makedocs(;
Modules = [CBFV]

deploydocs(;
repo="github.com/stefanbringuier/CBFV.jl",
repo="github.com/JuliaMatSci/CBFV.jl",
)
10 changes: 10 additions & 0 deletions docs/src/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,14 @@ using DataFrames #hide
using CBFV #hide
d = DataFrame(:formula=>["Tc1V1","Cu1Dy1","Cd3N2"],:target=>[248.539,66.8444,91.5034]) #hide
generatefeatures(d,elementdata="jarvis")
```

Another example:

```@example
using DataFrames
using CBFV
data = DataFrame("name"=>["Rb2Te","CdCl2","LaN"],"bandgap_eV"=>[1.88,3.51,1.12])
rename!(data,Dict("name"=>"formula","bandgap_eV"=>"target"))
features = generatefeatures(data)
```
16 changes: 14 additions & 2 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# CBFV.jl : A simple composition-based feature vectorization utility in Julia
[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://stefanbringuier.github.io/CBFV.jl/stable) [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://stefanbringuier.github.io/CBFV.jl/dev) [![Build Status](https://github.com/stefanbringuier/CBFV.jl/workflows/CI/badge.svg)](https://github.com/stefanbringuier/CBFV.jl/actions) [![Build Status](https://travis-ci.com/stefanbringuier/CBFV.jl.svg?branch=master)](https://travis-ci.com/stefanbringuier/CBFV.jl) [![Coverage](https://codecov.io/gh/stefanbringuier/CBFV.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/stefanbringuier/CBFV.jl)
[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://juliamatsci.github.io/CBFV.jl/stable) [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://juliamatsci.github.io/CBFV.jl/dev) [![Build Status](https://github.com/juliamatsci/CBFV.jl/workflows/CI/badge.svg)](https://github.com/JuliaMatSci/CBFV.jl/actions) [![Build Status](https://travis-ci.com/JuliaMatSci/CBFV.jl.svg?branch=master)](https://travis-ci.com/JuliaMatSci/CBFV.jl) [![Coverage](https://codecov.io/gh/JuliaMatSci/CBFV.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/JuliaMatSci/CBFV.jl)

This is a Julia rewrite of the [python tool](https://github.com/kaaiian/CBFV) to create a composition-based feature vector representation for machine learning with materials science data. The ideas and methodology are discussed in the recent article:

Expand All @@ -11,6 +11,18 @@ and the original python source code(s) can be found here:
- [https://github.com/anthony-wang/BestPractices/tree/master/notebooks/CBFV](https://github.com/anthony-wang/BestPractices/tree/master/notebooks/CBFV)
- [https://github.com/kaaiian/CBFV](https://github.com/kaaiian/CBFV)

## Example Use

The input data set should have a least two columns with the header/names `formula` and `target`.

```@example
using DataFrames
using CBFV
data = DataFrame("name"=>["Rb2Te","CdCl2","LaN"],"bandgap_eV"=>[1.88,3.51,1.12])
rename!(data,Dict("name"=>"formula","bandgap_eV"=>"target"))
features = generatefeatures(data)
```

## Supported Featurization Schemes

As with the orignal CBFV python package the following element databases are available:
Expand All @@ -24,7 +36,7 @@ As with the orignal CBFV python package the following element databases are avai

However, `CBFV.jl` will allow you to provide your own element database to featurize with. Also, the current implementation reads the saved `.csv` file in [`databases`](@ref), however, this is prone to potential issues (ex. out of date files). To alleviate this I will change the implementation to utilize `Pkg.Artificats` with a `Artificats.toml` file that enables grabbing the datafiles needed from a server if they don't exist locally already.

## Julia Dependencies
### Julia Dependencies
This is a relatively small package so there aren't a lot of dependencies. The required packages are:

- CSV
Expand Down
4 changes: 2 additions & 2 deletions src/Errors.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ function checkdataframe(dataframe::DataFrame)
end

function checkcombineallowed(dataframe::DataFrame)
extrprops = dataframe[!,Not([:formula,:target])]
if !isempty(extrprops)
#extrprops = dataframe[!,Not([:formula,:target])]
if !isempty(dataframe)
return true
else
@warn "Combining of features requested but none exist, skipping!"
Expand Down
66 changes: 36 additions & 30 deletions src/Featurization.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# see LICENSE

""" combinefeatures!(features,extras)
combinefeatures!(features,featnames,extras)
""" combinefeatures(features,extras)
combinefeatures(features,featnames,extras)
Combines existing features in data with the prepared features. Returns additional
Expand All @@ -12,27 +12,32 @@ vector of column names for a database.
- `extras::DataFrame`: The data frame representation of the orignial data.
- `featnames::Vector`: The column names of the generated features.
# Returns (Optional)
# Returns
- `newfeatures::AbstractArray`: Combined features
- `combfeatnames::Vector{String}`: Combined names of feature columns.
"""
function combinefeatures!(features::AbstractArray, extras::DataFrame)
function combinefeatures(features::AbstractArray, extras::DataFrame)
if checkcombineallowed(extras)
extrasarry = Tables.matrix(extras)
features = hcat(features, extrasarry)
newfeatures = hcat(features, extrasarry)
else
newfeatures = features
end
return newfeatures
end # function combinefeatures

combinefeatures!(features::AbstractArray, featnames::Vector, extras::DataFrame) = begin
combinefeatures(features::AbstractArray, featnames::Vector, extras::DataFrame) = begin

if checkcombineallowed(extras)
extrasarry = Tables.matrix(extras)
features = hcat(features, extrasarry)
newfeatures = hcat(features, extrasarry)
combfeatnames = vcat(featnames, names(extras))
else
newfeatures = features
combfeatnames = featnames
end
return combfeatnames
return newfeatures,combfeatnames
end

"""
Expand All @@ -51,12 +56,12 @@ function assignfeatures(processeddata::Vector{Dict{Symbol,Any}},
iterformulae = ProgressBar(1:length(formulae))
skippedformula = Array{String,1}()

features = Vector{Matrix{Number}}(undef, length(formulae))
features = Vector{Matrix{Float64}}(undef, length(formulae))

for i in iterformulae
Threads.@threads for i in iterformulae
formula = formulae[i]
amount = processeddata[i][:amount]
properties = processeddata[i][:eleprops]
amount = processeddata[i][:amount]::Vector{Float64}
properties = processeddata[i][:eleprops]::Matrix{Float64}

# Each formula has a n-element by m-feature matrix representation.
# Construct all the feature vectors
Expand Down Expand Up @@ -95,15 +100,20 @@ column name prefixes are fixed based on the CBFV approach which is to use the fo
moments from the element features in the formula.
# Arguments
- `featcolnames::Vector{String}` : The name of the columns for the feature vectors
- `features::Array{Float64,2}` : The feature vectors
- `extrafeatures::Tuple{Bool,DataFrame}` : These are the features carried from the input data
- `sumfeatures::Bool` : wheter or not to add sum statistics feature vector
# Returns
- `DataFrame` : the dataframe for the features
"""
function constructfeaturedataframe(featcolnames::Vector{String},
features::Array{Number,2},
features::Array{Float64,2},
extrafeatures::Tuple{Bool,DataFrame},
sumfeatures::Bool)

if sumfeatures
colprefixes = ["sum_", "avg_", "dev_", "range_", "max_", "min_", "mode_"]
else
Expand All @@ -119,10 +129,11 @@ function constructfeaturedataframe(featcolnames::Vector{String},
dictfeatnames = Dict{String,Vector}()

if extrafeatures[1]
combinedfeatnames = combinefeatures!(features, featnames, extrafeatures[2])
for (i, n) in enumerate(combinedfeatnames)
dictfeatnames[n] = features[i, :]
end
#combfeatures,combinedfeatnames = combinefeatures(features, featnames, extrafeatures[2])
#for (i, n) in enumerate(combinedfeatnames)
# dictfeatnames[n] = combfeatures[:,i]
#end
@info "The combine feature is not correctly implemented and is being skipped!"
else
for (i, n) in enumerate(featnames)
dictfeatnames[n] = features[:, i]
Expand Down Expand Up @@ -173,14 +184,6 @@ using CBFV
d = DataFrame(:formula=>["Tc1V1","Cu1Dy1","Cd3N2"],:target=>[248.539,66.8444,91.5034])
generatefeatures(d)
```
# TODOs
- Add dropduplicate Optional
- Decide what to do with `skippedformulas`
- Process elementa data features with `NaN`
"""
function generatefeatures(data::DataFrame;
elementdata::String="oliynyk",
Expand All @@ -189,20 +192,22 @@ function generatefeatures(data::DataFrame;
sumfeatures=false,
returndataframe=true)


# Process input data
checkdataframe(data)
formulae = data[!, :formula]
featcolnames, processeddata = processinputdata(data, elementdata)
featcolnames, processeddata = processinputdata(dropduplicate ? unique(data) : data, elementdata)

targets = [row[:target] for row in processeddata]

# Featurization
features, skippedformulas = assignfeatures(processeddata,
formulae,
sumfeatures)
extrafeatures = data[!, Not([:formula, :target])]

#TODO: need to fill features that are NaN with median values.
# Extra features from original data
extra_df = data[!, Not([:formula, :target])]
extrafeatures = dropduplicate && !isempty(extra_df) ? unique(extra_df) : extra_df

if returndataframe
generatedataframe = constructfeaturedataframe(featcolnames, features, (combine, extrafeatures), sumfeatures)
Expand All @@ -211,7 +216,8 @@ function generatefeatures(data::DataFrame;
return generatedataframe
else
if combine
combinefeatures!(features, extrafeatures)
#combinefeatures(features, extrafeatures)
@info "The combine feature is not correctly implemented and is being skipped!"
end
return formulae, features, targets
end
Expand Down
48 changes: 36 additions & 12 deletions src/ProcessData.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,24 @@
# see LICENSE

"""
replacemissing!(data)
replace missing values in columns of a dataframe with average value of that column.
# Arguments
- `data::DataFrame`: data frame with missing values.
# Modifies
- `data::DataFrame`: changes `missing` values to mane values of column.
""" function replacemissing!(data::DataFrame)

columnnames = names(data[!, Not(:element)])
statdata = describe(data);
for (i,n) in enumerate(columnnames)
data[!,n] = coalesce.(data[!,n],statdata[i,:mean])
end
end

"""
removeunsupported!(datainput,elementproperties)
Expand All @@ -17,7 +36,7 @@ Handle cases where compound can't be processed because it isn't an allowed eleme
function removeunsupported!(datainput::DataFrame)

formulas = copy(datainput[!, :formula])
rows = range(1, nrow(datainput))
rows = range(1, nrow(datainput),step=1)
splitformulas = keys.(getrepresentation.(formulas))

for i = 1:length(formulas)
Expand Down Expand Up @@ -47,12 +66,13 @@ returns an array of properties for elements that are in a formula.
"""
function extractproperties(elements::Vector{String},
properties::AbstractArray,
formulaelements::Array{String,1},
formula::String)
properties::Array{T,2},
formulaelements::Array{String,1},
formula::String) where T<:Number

_, m = size(properties)
l = length(formulaelements)
extractedproperties = Array{Number,2}(undef, l, m)
extractedproperties = Array{Float64,2}(undef, l, m)

for (i, e) in enumerate(formulaelements)
if stripamt(e) elements
Expand Down Expand Up @@ -106,10 +126,12 @@ of the entire database.
- `arrayrepresentation::Array{Any,2}`: representation of the dataframe
"""
function processelementdatabase(data::DataFrame)

columnnames = names(data[!, Not(:element)])
function processelementdatabase(data::DataFrame;replacemissing=true)

if replacemissing
replacemissing!(data)
end

elementsymbols = convert(Vector{String}, data[!, :element])
elementindex = collect(1:nrow(data))
elementmissing = collect(setdiff(
Expand All @@ -122,17 +144,19 @@ function processelementdatabase(data::DataFrame)

arrayrepresentation = Tables.matrix(data[!, Not(:element)])

columnnames = names(data[!, Not(:element)])

return elementinfo, columnnames, arrayrepresentation
end # function processelementdatabase

processelementdatabase(databasename::String) = begin
processelementdatabase(databasename::String;kwargs...) = begin
data = getelementpropertydatabase(databasename)
processelementdatabase(data)
processelementdatabase(data,kwargs...)
end

processelementdatabase(databasepath::FileName) = begin
processelementdatabase(databasepath::FileName;kwargs...) = begin
data = readdatabasefile(databasepath.fullpath)
processelementdatabase(data)
processelementdatabase(data,kwargs...)
end


Expand Down
Loading

0 comments on commit 78b9ebf

Please sign in to comment.