-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hxlquickmeta
(cli tool) + HXLMeta (Usable Class)
#9
Comments
hxlmeta
(cli tool) + HXLMeta (Usable Class)hxlquickmeta
(cli tool) + HXLMeta (Usable Class)
Oh boy, name is complicated. HXLMeta_GlossaryThe EticaAI-Data_HXL-Data-Science-file-formats have an HXLMeta_Glossary to describe concepts (actually, some sort of "ID" to reference on other documents. This still not the final result, but there is so many concepts on so many programs, that to make some sense or everything I'm trying go put some names. But eventually we could come with some nice ways to mention it. ReferencesAlso, the table EticaAI-Data_HXL-Data-Science-file-formats_References have part of documents I'm just letting there. Some of these may be used later as reference to the actually development. TODO: software comparisonMaybe at some point will be necessary put on a place some of the software we're testing, but instead of have one spreadsheet for every software, have some by how the terms relate to each other. I mean: how Another point is that some of open source tools actually are more powerfully than others and can open several proprietary formats, like the Jamovi (https://www.jamovi.org/), so this table in special could be used by users beyond us to check alternatives. See:
|
…aAI-Data_HXL-Data-Science-file-formats_HXLMeta_Glossary
The EticaAI-Data_HXL-Data-Science-file-formats_HXLMeta_StatisticalType table still an working draft, but both Statistical Data Type (https://en.wikipedia.org/wiki/Statistical_data_type) and Level of measurement (https://en.wikipedia.org/wiki/Level_of_measurement) may actually worth to use as internal taxonomy, since is the most close to an way to translate different variable types between different softwares. Or, in other words: since most of them already use math (but some often use some terms to refer to a group of math taxonomy), we could internaly use the more close to the math taxonomy and then, by program, translate how each one use them. This is likely to be more hard to make initial draft (like very hard), but I think that may be much more easy later. |
…m1-#adm5; GLOSSARY have values for DataType, StorageType and StatisticalType
…eType; Added HXL_REFERENCE.hashtag.x_example
… +id, +label, +name, +url, +email, +phone, +date
…ashtag="#adm2+code" --hxlquickmeta-value="BR3106200"_
…atypes (for debug & compare with other tools)
…orks both for inline debug and inspection of local or remote file
…ype(), get_statistical_type(), get_storage_type(), get_usage_type(), get_weight_level()
With the last commit, the At this point, it still threat HXLated files as average CSV so pandas somewhat bruteforce what the columns means. This is not ideal, but hxlquickmeta already can be used as a proxy to quick analyse any remote dataset (still need to test a big more if CSV still not HXLated, but libhxl-python already tolerate CSV files if we use the logic of the
fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hxlquickmeta https://data.humdata.org/dataset/2d058968-9d7e-49a9-b28f-2895d7f6536f/resource/a12bad12-f5ea-493c-9faa-66cb3f3e9ca7/download/fts_incoming_funding_bra.csv
> Connection overview
>> TODO: implement raw connection, HTTP headers, etc
>> (this should output debug information even
>> for inputs that would break libhxl)
> lihxl-python overview
>> output.output <_io.TextIOWrapper name='/tmp/tmpkwejyuli' mode='w' encoding='UTF-8'>
>> source <hxl.io.HXLReader object at 0x7f17b3d0fa00>
> HXLMeta debuginfo
>> HXLMeta.text_headers ['date', 'budgetYear', 'description', 'amountUSD', 'srcOrganization', 'srcOrganizationTypes', 'srcLocations', 'srcUsageYearStart', 'srcUsageYearEnd', 'destPlan', 'destPlanCode', 'destPlanId', 'destOrganization', 'destOrganizationTypes', 'destGlobalClusters', 'destLocations', 'destProject', 'destProjectCode', 'destEmergency', 'destUsageYearStart', 'destUsageYearEnd', 'contributionType', 'flowType', 'method', 'boundary', 'onBoundary', 'status', 'firstReportedDate', 'decisionDate', 'keywords', 'originalAmount', 'originalCurrency', 'exchangeRate', 'id', 'refCode', 'createdAt', 'updatedAt']
>> HXLMeta.hxl_headers ['#date', '#date+year+budget', '#description+notes', '#value+funding+total+usd', '#org+name+funder', '#org+type+funder+list', '#country+iso3+funder+list', '#date+year+start+funder', '#date+year+end+funder', '#activity+appeal+name', '#activity+appeal+id+external', '#activity+appeal+id+fts_internal', '#org+name+impl', '#org+type+impl+list', '#sector+cluster+name+list', '#country+iso3+impl+list', '#activity+project+name', '#activity+project+code', '#crisis+name', '#date+year+start+impl', '#date+year+end+impl', '#financial+contribution+type', '#financial+contribution+type', '#financial+method', '#financial+direction', '#financial+direction+type', '#status+text', '#date+reported', '#date+decision', '#description+keywords', '#value+funding+total', '#value+funding+total+currency', '#financial+fx', '#activity+id+fts_internal', '#activity+code', '#date+created', '#date+updated']
### (LONG LIST OMMITED) ####
>> HXLMetaExtras: Pandas DataFrame
>>> DataFrame
#date #date+year+budget #description+notes #value+funding+total+usd ... #activity+id+fts_internal #activity+code #date+created #date+updated
0 2020-05-24 NaN Venezuela Migrants Outflows Multiyear 2019 to ... 0 ... 210646 NaN 2020-05-24 2020-07-23
1 2019-11-30 NaN Integral Protection and Humanitarian Assistanc... 0 ... 202767 7F-10139.02 2019-12-10 2020-10-20
2 2019-10-31 NaN Economic Integration of Venezuelan Migrants an... 225990 ... 215375 PC-2020-001 2020-07-23 2020-07-23
[3 rows x 37 columns]
>>> DataFrame.T
0 1 2
#date 2020-05-24 2019-11-30 2019-10-31
#date+year+budget NaN NaN NaN
#description+notes Venezuela Migrants Outflows Multiyear 2019 to ... Integral Protection and Humanitarian Assistanc... Economic Integration of Venezuelan Migrants an...
#value+funding+total+usd 0 0 225990
#org+name+funder European Commission EuropeAid Development and ... Switzerland, Government of United States of America, Government of
#org+type+funder+list Inter-governmental Government Government
#country+iso3+funder+list NaN CHE USA
#date+year+start+funder 2019 2019 2019
#date+year+end+funder 2019 2019 2019
#activity+appeal+name NaN NaN NaN
#activity+appeal+id+external NaN NaN NaN
#activity+appeal+id+fts_internal NaN NaN NaN
#org+name+impl International Organization for Migration Comitato Internationale per lo Sviluppo dei Po... International Labour Organization
#org+type+impl+list UN agency NGO UN agency
#sector+cluster+name+list NaN NaN Multi-sector
#country+iso3+impl+list ABW,ARG,BOL,BRA,CHL,COL,CRI,CUW,DOM,ECU,GUY,ME... ABW,ARG,BOL,BRA,CHL,COL,CRI,CUW,DOM,ECU,GUY,ME... ABW,ARG,BOL,BRA,CHL,COL,CRI,CUW,DOM,ECU,GUY,ME...
#activity+project+name NaN NaN NaN
#activity+project+code NaN NaN NaN
#crisis+name VENEZUELA Outflow - Regional Refugees and Migr... VENEZUELA Outflow - Regional Refugees and Migr... VENEZUELA Outflow - Regional Refugees and Migr...
#date+year+start+impl 2019 2019 2021
#date+year+end+impl 2021 2021 2021
#financial+contribution+type financial financial financial
#financial+contribution+type.1 Parked Parked Standard
#financial+method Traditional aid Traditional aid Traditional aid
#financial+direction incoming incoming incoming
#financial+direction+type shared shared shared
#status+text commitment commitment paid
#date+reported 2020-05-24 2019-12-05 2020-07-23
#date+decision 2019-05-09 NaN 2019-10-31
#description+keywords Multiyear Multiyear NaN
#value+funding+total NaN 0.0 NaN
#value+funding+total+currency NaN CHF NaN
#financial+fx NaN 0.992 NaN
#activity+id+fts_internal 210646 202767 215375
#activity+code NaN 7F-10139.02 PC-2020-001
#date+created 2020-05-24 2019-12-10 2020-07-23
#date+updated 2020-07-23 2020-10-20 2020-07-23
>>> DataFrame.describe
#date+year+budget #value+funding+total+usd #date+year+start+funder #date+year+end+funder ... #date+year+end+impl #value+funding+total #financial+fx #activity+id+fts_internal
count 0.0 3.000000 3.0 3.0 ... 3.0 1.0 1.000 3.000000
mean NaN 75330.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 209596.000000
std NaN 130475.387334 0.0 0.0 ... 0.0 NaN NaN 6369.245717
min NaN 0.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 202767.000000
25% NaN 0.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 206706.500000
50% NaN 0.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 210646.000000
75% NaN 112995.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 213010.500000
max NaN 225990.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 215375.000000 |
…re an well HXLated CSV fully saved on local disk (cannot work with data stream)
Note: comment #6 (comment) also mention difference about work or not with streaming Note:
|
…icated folder (to have specific classes for each type)
…coments NOT extend numpy data types if need a lot of non numeric labels; but maybe xarray could come later (or only be used as external tool, while keeping the internal abstractions to data types cleaner)
…aHtype, urlDataHtype, emailDataHtype, phoneDataHtype, dateDataHtype
…oth need to be added on spreadsheets)
…potential values to constants (DII, PII, etc)
…ry to installable namespaced (with -eticaai) pip package; still need to figure out this python pip thing
…ed packages (for pluggable extensions)
…nd namespaced packages for local developemnt (aka when not installed as package, but using on VSCode)
Same comments done here #6 (comment) apply to
Its a bit sad that I did not took time to explain what was the hxlquickmeta, but anyway, we're already going even more deep with HDP #16.
|
One feature of the HXLTabConverter common class #8 (since we're already reading all documentation to see how to make inferences without forcing users to use type hints in all places) actually requires knowing the supposed data types of already HXLated datasets. So, let's break in an separate class [and as much as possible already try to use data structures that could be converted from JSON or something] to create something that actually could make these inferences
The more specific HXL Core hashtags
One advantage of using the hashtag that already is the very own defined on the specification is that the specification for several cases enforce the types. This happens on special for indicators. So, actually, is possible to (at least if is not doing something like brute forcing with the
hxlquickimport
) be somewhat sure about what to expect from the data columns.Which accuracy to aim?
In my personal, honest opinion, >90% of the cases is good enough, including making inferences beyond the official documentation (but at this point may need to do some checking on at least a good amount of rows to deduce. But should exist one way that allows users to explicitly enforce (even if it means a more verbose attribute).
Maybe a different approach to tolerate even less accuracy on first try (think like >75%", maybe less) is if is possible to easily import back the exported format (think the .tab from Orange Data Mining, but could be Weka and others) we assume that the data types and data flags (is meta? This can be ignored? Etc) could already be imported back with more data type hints that if exported again would not change.
In other words: for very long spreadsheets, somewhat already optimized to be corrected on an external program. (I think this is much more likely to happen for data flags than data types, in fact we may need to create some way to allow more than one target variable).
How to warn the suggestions outside what already is strictly defined on the HXL Standard
Also, since already do exist the concept of Debug logs, I think when we try to make inferences on the tags that are less than 90% (or maybe we discover that an analysis of 100 (or up to 10.000) the user literally done poor tagging and this is 98% likely to fail on external data mining tools, we still warn the user (This type of feature would be need if trying to brute force with the hxlquickimport, so at least some quick checks could already exist).
The text was updated successfully, but these errors were encountered: