Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON output #73

Open
ross-spencer opened this issue Jan 18, 2022 · 0 comments
Open

JSON output #73

ross-spencer opened this issue Jan 18, 2022 · 0 comments
Labels
enhancement a feature by any other name

Comments

@ross-spencer
Copy link
Member

ross-spencer commented Jan 18, 2022

Using the following pattern starts to get us into the realm of a decent JSON output.

import json
print(json.dumps(analysis_results.__dict__, sort_keys=True, indent=2))

The output below is roughly all the top level fields. There are at least two problems with the below:

  1. Insensitive naming in the context of a collection. Bad File Names is not representative of both the end-user's collection and the intention of the code. We want to express something more like "names that need more care", e.g. we've all seen ���� when we don't want to. This is more about making sure data is preserved in end to end workflows.

NB. In the output report, this field is Identifying Non-ASCII and System File Names. The naming comes from the member variable within the analysis results object.

  1. Naming conventions. The naming conventions are all over the place. Who knew PEP8 was a thing even in 2014?! (jk) At least snake case these. Golang JSON naming should be considered. In golang Capitalized fields are exported and can be read implicitly into code: bof_distance here, which is correct, might become BOFDistance, collectionsize becomes CollectionSize. I don't know if these names can be aliased somehow, where .__dict__ outputs member variables as-is.

There's a lot of data output, but JSON tools might be able to use this sensible. I should consider documenting examples.

Example output:

  "badDirNames": [], 
  "badFileNames": [
  "binaryidentifiers": [
  "bof_distance": [
  "collectionsize": 397567751, 
  "containercount": 13, 
  "containertypeslist": [
  "dateFrequency": [
  "denylist": null, 
  "denylist_directories": [], 
  "denylist_exts": [], 
  "denylist_filenames": [], 
  "denylist_ids": [], 
  "directoryCount": 51, 
  "distinctFilenameIdentifiers": 1, 
  "distinctOtherIdentifiers": 31, 
  "distinctSignaturePuidcount": 51, 
  "distinctTextIdentifiers": 5, 
  "distinctXMLIdentifiers": 0, 
  "distinctextensioncount": 59, 
  "duplicateHASHlisting": [
  "duplicatespathlist": [
  "eof_distance": [
  "errorlist": [
  "extensionIDOnlyCount": 3, 
  "extensionOnlyIDFrequency": [
  "extensionOnlyIDList": [
  "extmismatchCount": 25, 
  "filecount": 324, 
  "filename": "opf-test-corpus-test-output/opf-test-corpus-sf-analysis", 
  "filename_identifiers": [
  "filenameidentifiers": [
  "filenameidfilecount": 2, 
  "filesincontainercount": 0, 
  "frequencyOfAllExtensions": [
  "hashused": true, 
  "identificationgaps": 53, 
  "identifiedPercentage": "83.6", 
  "identifiedfilecount": 271, 
  "idmethodFrequency": [
  "mimetypeFrequency": [
  "multipleidentificationcount": 0, 
  "namespacecount": 3, 
  "namespacedata": null, 
  "nsdatalist": [
  "rogue_all_dirs": null, 
  "rogue_all_paths": null, 
  "rogue_denylist": [], 
  "rogue_dir_name_paths": [], 
  "rogue_duplicates": [
  "rogue_extension_mismatches": [], 
  "rogue_file_name_paths": [], 
  "rogue_identified_all": [
  "rogue_identified_pronom": [], 
  "rogue_multiple_identification_list": [], 
  "rogue_pronom_ns_id": null, 
  "signatureidentifiedfrequency": [
  "signatureidentifiers": [
  "text_identifiers": [
  "textidentifiers": [
  "textidfilecount": 8, 
  "tooltype": "siegfried: 1.5.0", 
  "totalHASHduplicates": 32, 
  "unidentifiedPercentage": "16.4", 
  "unidentifiedfilecount": 53, 
  "uniqueDirectoryNames": 50, 
  "uniqueExtensionsInCollectionList": [
  "uniqueFileNames": 315, 
  "version": 0, 
  "xml_identifiers": [
  "xmlidentifiers": null, 
  "xmlidfilecount": 0, 
  "zerobytecount": 28, 
  "zerobytelist": [
  "zeroidcount": 40
@ross-spencer ross-spencer added the enhancement a feature by any other name label Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement a feature by any other name
Projects
None yet
Development

No branches or pull requests

1 participant