-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questing regarding loading UEA public datasets #92
Comments
Hi, Thanks for your kind words. Regarding your question, I tried with
I don't think that there is an issue with the current code, but I may be wrong. Could you give me more details about what made you have some doubts (which dataset, etc.) ? Best, |
Hi Johann,
I really appreciate your quick reply. The dataset I was working on is
Wafer, which is a binary-class dataset. However, when I use the provided
function, fetch_uea_dataset(), to load the target dataset, the values of
data['target_train'] are more than two unique values. This finding makes me
tracking how this function loads data, and splits the data into features
and labels. Please feel free to let me know if I made any mistakes with
using the library. Thanks a lot. :)
Best,
Michael
…On Thu, Feb 18, 2021 at 4:18 AM Johann Faouzi ***@***.***> wrote:
Hi,
Thanks for your kind words.
Regarding your question, I tried with SelfRegulationSCP1 and the labels
are strings ('negativity' and 'positivity') and not -1 and +1 indeed. The
labels are directly taken from the files, so I'm not sure that the best
solution to change the labels in this function as it could be confusing for
other users familiar with the datasets *as they are*. Maybe a better
solution would be to change the labels directly in the original files. I
think that you can raise an issue on this repository to do so:
https://github.com/uea-machine-learning/tsml_repo
X_data[i] is a numpy.void object with 2 elements, so X_data[i][1] and
X_data[i][-1] are equivalent. The ARFF structure is not really intuitive
(I didn't know its existence before working on this project), so it's not
as easy as a CSV file with all but last column for the input data and the
last column for the target data.
I don't think that there is an issue with the current code, but I may be
wrong. Could you give me more details about what made you have some doubts
(which dataset, etc.) ?
Best,
Johann
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#92 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AS4PY3I5W4DWA6V5IUT7XBTS7TLQBANCNFSM4XZIGVLQ>
.
|
By the way, Wafer is the second dataset I tried before tracking the fetch_uea_dataset() function. The first dataset is SharePriceIncrease, which also should be a binary-class dataset, but the labels contain more than two unique values. Thanks, |
I think that I understood the issue.
That being said, you should have an error when using With Line 283 in 1aa4558
because an OSError is raised and not an IndexError . When replacing this line with except (IndexError, OSError): , it works as intended. I think that I only used an IndexError because all the previous univariate datasets always had TXT files, but it seems not to be the case for more recent ones.
I hope this helps you a bit. I have some work to do to update these functions and I'm a bit busy right now, so I would suggest you to use these fixes yourself in your local version of Best, |
Hi Johann,
Yes, you are absolutely right. I think there are two reasons: (1) I worked
on uni-variate time series, and (2) I just commented out the line for
checking the dataset list since my target datasets are not in that list.
By the way, I have solved the problem by customizing the
fetch_uea_dataset(). Thank you so much! :)
Best Regards,
Michael
…On Fri, Feb 19, 2021 at 3:30 AM Johann Faouzi ***@***.***> wrote:
I think that I understood the issue.
UCR usually refers to the *univariate* time series classification
archive, while UEA refers to the *multivariate* time series
classification archive. Both datasets (Wafer and SharePriceIncrease) are
*univariate* time series classification datasets, and should thus be
loaded using pyts.datasets.fetch_ucr_dataset
<https://pyts.readthedocs.io/en/stable/generated/pyts.datasets.fetch_ucr_dataset.html#pyts.datasets.fetch_ucr_dataset>.
fetch_uea_dataset will give unexpected results in this case. I should
probably add a test in this function to make sure the dataset is
multivariate and it would raise an error when trying to load a univariate
dataset.
That being said, you should have an error when using
pyts.datasets.fetch_ucr_dataset or pyts.datasets.fetch_uea_dataset if you
don't provide the folder (data_home parameter) because these datasets are
not listed in the available datasets (I haven't updated the list of
available datasets for a while, I should definitely do it).
With pyts.datasets.fetch_ucr_dataset I can load a local version of Wafer.
I can't with SharePriceIncrease because of this line:
https://github.com/johannfaouzi/pyts/blob/1aa45589b91a12e8d55db86f1f97dca0b6e99984/pyts/datasets/ucr.py#L283
because an OSError is raised and not an IndexError. When replacing this
line with except (IndexError, OSError):, it works as intended.
I hope this helps you a bit. I have some work to do to update these
functions and I'm a bit busy right now, so I would suggest you to use these
fixes yourself in your local version of pyts, but I will try to add them
in the repository as soon as possible.
Best,
Johann
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#92 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AS4PY3IA5H2UQLAXTLRYLKLS7YOSJANCNFSM4XZIGVLQ>
.
|
Hi, first of all, I really appreciate this wonderful library for processing time-series related issues. I am using pyts to loading UEA datasets, but I found that when I load a binary-class dataset, but the loaded labels are not binary. After debugging, I guess this might some issues existed with the line I provided below.
pyts/pyts/datasets/uea.py
Line 297 in 1aa4558
I think the last number, X_data[i][-1], should be the label, either -1 or 1, instead of X_data[i][1]. Moreover, the X should drop the last column, which stands for labels.
I am not sure that my interpretation is correct or not. I look forward to hearing from you, and I wish this powerful tool becomes better and better. Thanks so much. :)
The text was updated successfully, but these errors were encountered: