Building the dependency graph #3

bmcfee · 2014-10-11T17:39:47Z

Most research software does not actually get cited directly. For example, a paper might cite sklearn but not numpy, or numpy but not BLAS, etc. Consequently, most research software is only cited implicitly.

To try and fill in the implied citation network, we can extract software dependencies from known repositories. This can take a few forms:

Python packages that use setuptools define their dependencies explicitly, and these are stored in a well-structured object that's easy to parse.
What about R?
What about MATLAB?
What about C/C++?

Alternatively, once we have a list of top-level packages, we can start crawling package management hierarchies:

Debian/ubuntu/etc
PyPI
Mathworks file exchange?
What about Mac users: anaconda? brew? ports?

Once we have a full tree, we'll have to prune it back to some reasonable level. It might be useful to include something like boost, but libc would obviously be a step too far. Where do we draw the line? Can this be automated?

sbenthall · 2014-10-28T17:56:43Z

Can I request that this dependency tracking be implemented in such a way that it can be imported as a module into another project?

I ask because I've been intending to do something similar to this for a collaboratin analysis tool my team has been working on:
https://github.com/sbenthall/bigbang

One thing I'd like to suggest (though it might be scope creep) is to think about how this integrates with version control. Software dependencies are something that change over time.

sbenthall · 2014-10-31T22:01:57Z

In the interest of reducing redundant effort, just putting a pointer here to the related feature request in BigBang

https://github.com/sbenthall/bigbang/issues/109

You might be interested in MetricGrimoire, which has a project, CVSAnalY, for version control data import

http://metricsgrimoire.github.io/

bmcfee · 2014-11-11T22:02:22Z

Yes, that's an excellent point. For something like pypi or debian, dynamic dependency tracking would be pretty straightforward since all packages are versioned. For the other, more esoteric sources (mathworks?), this seems pretty treacherous, but maybe soluble via timestamps.

I definitely like the idea of implementing that as a standalone module. I worry a little about having common identifiers across modules if it gets split up, but canonical naming can be part of the functionality of the dependency tracking module.

sbenthall mentioned this issue Oct 31, 2014

build dependency graph over time from a collection of repositories datactive/bigbang#109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building the dependency graph #3

Building the dependency graph #3

bmcfee commented Oct 11, 2014

sbenthall commented Oct 28, 2014

sbenthall commented Oct 31, 2014

bmcfee commented Nov 11, 2014

Building the dependency graph #3

Building the dependency graph #3

Comments

bmcfee commented Oct 11, 2014

sbenthall commented Oct 28, 2014

sbenthall commented Oct 31, 2014

bmcfee commented Nov 11, 2014