Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hxlquickimport #6

Closed
fititnt opened this issue Jan 29, 2021 · 4 comments
Closed

hxlquickimport #6

fititnt opened this issue Jan 29, 2021 · 4 comments
Labels
proof-of-concept-already-exist Do exist proof of concept (or better) for this issue

Comments

@fititnt
Copy link
Member

fititnt commented Jan 29, 2021

Meta

hxl +public  
meta +status working-draft
meta +id EticaAI-Data_HXL-Data-Science-file-formats_hxlquickimport
meta +discussion+public #6
meta +hxlproxy +url https://proxy.hxlstandard.org/data?dest=data_view&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY%2Fedit%23gid%3D1097528220
meta +description hxlquickimport is a quick (and wrong) way to importnon-HXL dataset (like an .csv or .xlsx, but requires headers already on thefirst row) without human intervention. It will try to slugify the originalheader and add as +attributefor a base hashtag like #meta.The result may be an HXL with valid syntax (that can be used for automatedtesting) but most HXL powered tools would still be human review.How does it work?"[Max Power] Kids: there's three ways to do things; the right way,the wrong way and the Max Power way![Bart Simpson] Isn't that the wrong way?[Max Power] Yeah, but faster!"(via https://www.youtube.com/watch?v=7P0JM3h7IQk)How to do it the right way?Read the documentation on https://hxlstandard.org/.(Tip: both HXL Postcards and the hxl-hashtag-chooser are very helpful!)

Spreadsheet data

See EticaAI-Data_HXL-Data-Science-file-formats_hxlquickimport (https://docs.google.com/spreadsheets/d/1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY/edit#gid=1097528220) for updated content. This is an snapshot.

Category Nome URL URL source
#item+category #item +name #item +url #item +source +url
test-dataset mx.gob.dados_dataset_informacion-referente-a-casos-covid-19-en-mexico_2020-06-01.csv https://drive.google.com/file/d/1nQAu6lHvdh2AV7q6aewGBQIxFz7VrCF9/view?usp=sharing https://github.com/CMedelR/dataCovid19
test-dataset br.einstein_dataset_covid-pacientes-hospital-albert-einstein-anonimizado_2020-03-28_before-HXLate https://docs.google.com/spreadsheets/d/1GQVrCQGEetx7RmKaZJ8eD5dgsr5i1zNy_UJpX3_AgrE/edit?usp=sharing https://www.kaggle.com/einsteindata4u/covid19
research-paper data-mining-for-the-study-of-the-epidemic-sars-cov-2-covid-19-algorithm-for-the-identification-of-patients-sars-cov-2-covid-19-in-mexico.pdf https://drive.google.com/file/d/1WaW2b7bGiSZjvc4OdA0kjrBtRTkKV11N/view?usp=sharing https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3619549
@fititnt fititnt changed the title hxl-quick-import hxlquickimport Jan 29, 2021
@fititnt
Copy link
Member Author

fititnt commented Jan 29, 2021

Thanks to @CMedelR!!!

Not only Ramírez have an research paper called Data mining for the study of the Epidemic (SARS- CoV-2) COVID-19: Algorithm for the identification of patients (SARS-CoV-2) COVID 19 in Mexico and his repository at https://github.com/CMedelR/dataCovid19 have an backup copy of the (at the moment) offline link at https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico, but his paper explicitly mention the use of the Orange Data Mining!

While his dataset will be used as additional test sample (the previous one was initially only the one from Albert Einstein Hospital on São Paulo), we're also adding his paper, since I'm very sure more people would like to find it later!

fititnt added a commit that referenced this issue Jan 29, 2021
… yet); hic sunt dracones

                    (__)    )
                    (..)   /|\\
                    (o_o)  / | \\
                    ___) \/,-|,-\|
                //,-/_\ )  '  '
                    (//,-'\|
                    (  ( . \_
                gnv `._\(___`.
                        '---' _)/
                            `-'
@fititnt fititnt added the proof-of-concept-already-exist Do exist proof of concept (or better) for this issue label Feb 6, 2021
@fititnt
Copy link
Member Author

fititnt commented Feb 21, 2021

The hxlquickmeta (cli tool) + HXLMeta (Usable Class) #9, while able to fallback and use Pandas and then Orange Data Mining, still fails with something like hxlquickmeta tests/files/iris.csv.

I think that at least for very basic CSV files, the hxlquickmeta could implement the features of hxlquickimport.

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hxlquickmeta tests/files/iris.csv
> Connection overview
 >> TODO: implement raw connection, HTTP headers, etc
 >>       (this should output debug information even
 >>       for inputs that would break libhxl)
ERROR! libhxl and/or HXLmeta/HXLMetaExtras failed <HXLException: HXL tags not found in first 25 rows>
Ok. Trying harder now with HXLMetaExtras...
 
 >> HXLMetaExtras: Pandas DataFrame 
   >>> DataFrame
     sepallength  sepalwidth  petallength  petalwidth           class
0            5.1         3.5          1.4         0.2     Iris-setosa
1            4.9         3.0          1.4         0.2     Iris-setosa
2            4.7         3.2          1.3         0.2     Iris-setosa
3            4.6         3.1          1.5         0.2     Iris-setosa
4            5.0         3.6          1.4         0.2     Iris-setosa
..           ...         ...          ...         ...             ...
145          6.7         3.0          5.2         2.3  Iris-virginica
146          6.3         2.5          5.0         1.9  Iris-virginica
147          6.5         3.0          5.2         2.0  Iris-virginica
148          6.2         3.4          5.4         2.3  Iris-virginica
149          5.9         3.0          5.1         1.8  Iris-virginica

[150 rows x 5 columns]
   >>> DataFrame.T
                     0            1            2            3            4            5    ...             144             145             146             147             148             149
sepallength          5.1          4.9          4.7          4.6          5.0          5.4  ...             6.7             6.7             6.3             6.5             6.2             5.9
sepalwidth           3.5          3.0          3.2          3.1          3.6          3.9  ...             3.3             3.0             2.5             3.0             3.4             3.0
petallength          1.4          1.4          1.3          1.5          1.4          1.7  ...             5.7             5.2             5.0             5.2             5.4             5.1
petalwidth           0.2          0.2          0.2          0.2          0.2          0.4  ...             2.5             2.3             1.9             2.0             2.3             1.8
class        Iris-setosa  Iris-setosa  Iris-setosa  Iris-setosa  Iris-setosa  Iris-setosa  ...  Iris-virginica  Iris-virginica  Iris-virginica  Iris-virginica  Iris-virginica  Iris-virginica

[5 rows x 150 columns]
   >>> DataFrame.describe
       sepallength  sepalwidth  petallength  petalwidth
count   150.000000  150.000000   150.000000  150.000000
mean      5.843333    3.054000     3.758667    1.198667
std       0.828066    0.433594     1.764420    0.763161
min       4.300000    2.000000     1.000000    0.100000
25%       5.100000    2.800000     1.600000    0.300000
50%       5.800000    3.000000     4.350000    1.300000
75%       6.400000    3.300000     5.100000    1.800000
max       7.900000    4.400000     6.900000    2.500000
 
 >> HXLMetaExtras: Orange Data Mining
data.domain [sepallength, sepalwidth, petallength, petalwidth, class]
data.columns <Orange.data.table.Columns object at 0x7f416848cd30>

@fititnt
Copy link
Member Author

fititnt commented Feb 21, 2021

I think that at least for very basic CSV files, the hxlquickmeta could implement the features of hxlquickimport.

My last comment can be ignored. Actually this may not need. As long as hxlquickmeta accept stdin (be piped) and all other tools work with pipes (the standard ones from HXLStandard works!) its not need at all implement this.

So instead of hxlquickmeta tests/files/iris.csv is just hxlquickimport tests/files/iris.csv | hxlquickmeta

this makes hxlquickmeta fails

# Non HXLated file
hxlquickmeta tests/files/iris.csv
(...)
ERROR! libhxl and/or HXLmeta/HXLMetaExtras failed <HXLException: HXL tags not found in first 25 rows>
Ok. Trying harder now with HXLMetaExtras...
(...)

This ones works (but not for complex Excel files)

# Non HXLated file
hxlquickimport tests/files/iris.csv | hxlquickmeta
## (...)
> lihxl-python overview
 >> output.output <_io.TextIOWrapper name='/tmp/tmphdplthem' mode='w' encoding='UTF-8'>
 >> source <hxl.io.HXLReader object at 0x7fc33c008820>
 
> HXLMeta debuginfo
 >> HXLMeta.text_headers None
 >> HXLMeta.hxl_headers ['#item+sepallength', '#item+sepalwidth', '#item+petallength', '#item+petalwidth', '#item+class']
> get_hashtag_info [ #item+sepallength ] [ None ]
(...)

Potential problem with hxlquickmeta if would not work with streams

I will make this comment on other issue. So it keeps notes for future.

@fititnt
Copy link
Member Author

fititnt commented Mar 28, 2021

The hxlquickimport already have an working proof of concept, and since is an all-in-one single file, can work even without [meta issue] hxlm #11 or the [meta] hxlm.core. As long as the depended libraries are installed, just need to put the bin/hxlquickimport on working path.

If need, this issue could be re-opened, but the current version of bin/hxlquickimport (single is mostly an hxltag with implicitly defaults, either could be something I would propose add to the HXLStandard/libhxl-python

Eventual point to be done (but not today)

Without actually doing a full refactoring to use something like the hxlm.core (or more 'pythonic'), maybe the bin/hxlquickimport will be moved to when installing this repository with

pip install https://github.com/EticaAI/HXL-Data-Science-file-formats

With this, at least would be more intuitive to explain another strategy of how to use these tools (and then the Minimal documentation about how to use the command line tools #1 could be solved)

@fititnt fititnt closed this as completed Mar 28, 2021
fititnt added a commit that referenced this issue Apr 20, 2021
fititnt added a commit that referenced this issue Apr 20, 2021
…, hxlquickmeta v1.2.0 (#9), hxl2tab v2.1 (#2); resolves mvp-documentation (fixes #1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proof-of-concept-already-exist Do exist proof of concept (or better) for this issue
Projects
None yet
Development

No branches or pull requests

1 participant