Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrected norwegian bokmal stopwords and removed nynorsk words #293

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/Microsoft.ML.Transforms/Microsoft.ML.Transforms.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
<TargetFramework>netstandard2.0</TargetFramework>
<IncludeInPackage>Microsoft.ML</IncludeInPackage>
<DefineConstants>CORECLR</DefineConstants>
<BaseIntermediateOutputPath>..\..\bin\obj\AnyCPU.Debug\Microsoft.ML.Transforms</BaseIntermediateOutputPath>
Copy link
Contributor

@glebuk glebuk Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

....\bin\obj\AnyCPU.Debug\Microsoft.ML.Transforms [](start = 31, length = 51)

There seems to be a DEBUG-only path in the general PropertyGroup section. The build type and platform should be replaced with appropriate variables. . As it stands the path is incorrect for the release build.

</PropertyGroup>

<ItemGroup>
Expand Down
Binary file modified src/Microsoft.ML.Transforms/Text/StopWords/Norwegian_Bokmal.txt
Binary file not shown.
14 changes: 11 additions & 3 deletions src/Microsoft.ML.Transforms/Text/StopWordsRemoverTransform.cs
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,10 @@ public enum Language
Polish = 12,
Czech = 13,
Arabic = 14,
Japanese = 15
Japanese = 15,

[HideEnumValue]
Norwegian_Bokmal_v1 = 256
Copy link
Contributor

@glebuk glebuk Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1 [](start = 29, length = 2)

shouldn't this be v2 as the original should remain as "Norwegian_Bokmal" #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @glebuk . The code as it stands is correct here. Usage of Norwegian_Bokmal should use whatever is best. But we must retain "old" files for backcompat, as mentioned. See my comment here. I went into this in some slight detail.

Copy link
Contributor

@glebuk glebuk Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we move the previous enum id to v1 and assign a new id to the new Norwegian_Bokmal enum label then? Otherwise how would the old model be compatible?


In reply to: 196494433 [](ancestors = 196494433)

}

public sealed class Column : OneToOneColumn
Expand Down Expand Up @@ -198,6 +201,11 @@ public ColInfoEx(ModelLoadContext ctx, ISchema input)
// int: the id of languages column name
Lang = (Language)ctx.Reader.ReadInt32();
Contracts.CheckDecode(Enum.IsDefined(typeof(Language), Lang));
if(Lang == Language.Norwegian_Bokmal
&& ctx.Header.ModelVerWritten == 0x00010001)
{
Lang = Language.Norwegian_Bokmal_v1;
}
_langsColName = ctx.LoadStringOrNull();
if (_langsColName != null)
{
Expand Down Expand Up @@ -229,8 +237,8 @@ private static VersionInfo GetVersionInfo()
{
return new VersionInfo(
modelSignature: "STOPWRDR",
verWrittenCur: 0x00010001, // Initial
verReadableCur: 0x00010001,
verWrittenCur: 0x00010002, // Initial
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of just

verWrittenCur: 0x00010002, // Initial

I would prefer something more along these lines.

//verWrittenCur: 0x00010002, // Initial
verWrittenCur: 0x00010001, // Corrected Norwegian Bokmål stopwords.

The idea is that as more versions are added, you have a little running catalog of why each version bump was necessary. The most extreme example of this is in our TextLoader, that is:

//verWrittenCur: 0x00010001, // Initial
//verWrittenCur: 0x00010002, // Added support for header
//verWrittenCur: 0x00010003, // Support for TypeCode
//verWrittenCur: 0x00010004, // Added _allowSparse
//verWrittenCur: 0x00010005, // Changed TypeCode to DataKind
//verWrittenCur: 0x00010006, // Removed weight column support
//verWrittenCur: 0x00010007, // Added key type support
//verWrittenCur: 0x00010008, // Added maxRows
// verWrittenCur: 0x00010009, // Introduced _flags
//verWrittenCur: 0x0001000A, // Added ForceVector in Range
verWrittenCur: 0x0001000B, // Header now retained if used and present

verReadableCur: 0x00010002,
verWeCanReadBack: 0x00010001,
loaderSignature: LoaderSignature);
}
Expand Down