A class to process and convert different kinds of texts/character encodings appearing in video games and elsewhere.
These are used to tell certain functions how to process input and output strings.
PRIMARY: Processes operation according to the previously set primary encodingUTF8: UTF-8UTF16LE: UTF-16 in Little EndianUTF16BE: UTF-16 in Big EndianUTF32LE: UTF-32 in Little EndianUTF32BE: UTF-32 in Big EndianASCII: ASCIIISO_8859_1: ISO-8859-1. Available aliases:LATIN1ISO_8859_2: ISO-8859-2. Available aliases:LATIN2ISO_8859_3: ISO-8859-3. Available aliases:LATIN3ISO_8859_4: ISO-8859-4. Available aliases:LATIN4ISO_8859_5: ISO-8859-5. Available aliases:CYRILLICISO_8859_6: ISO-8859-6. Available aliases:ARABICISO_8859_7: ISO-8859-7. Available aliases:GREEKISO_8859_8: ISO-8859-8. Available aliases:HEBREWISO_8859_9: ISO-8859-9. Available aliases:TURKISH,LATIN5ISO_8859_10: ISO-8859-10. Available aliases:NORDIC,LATIN6ISO_8859_11: ISO-8859-11. Available aliases:THAIISO_8859_13: ISO-8859-13. Available aliases:BALTIC,LATIN7ISO_8859_14: ISO-8859-14. Available aliases:CELTIC,LATIN8ISO_8859_15: ISO-8859-15. Available aliases:WEST_EUROPEAN,LATIN9ISO_8859_16: ISO-8859-16. Available aliases:SOUTHEAST_EUROPEAN,LATIN10SHIFTJIS_CP932: Shift Jis Code Page 932. Available aliases:CP932,SHIFT_JIS_CP932,SJIS932,MS932JIS_X_0201_FULLWIDTH: JIS X 0201 in Full Width KatakanaJIS_X_0201_HALFWIDTH: JIS X 0201 in Half Width KatakanaKS_X_1001: KS X 1001. Available aliases:EUC_KR,KS_C_5601POKEMON_GEN1_ENGLISH: Pokémon Gen I EnglishPOKEMON_GEN1_FRENCH_GERMAN: Pokémon Gen I French & GermanPOKEMON_GEN1_ITALIAN_SPANISH: Pokémon Gen I Italian & SpanishPOKEMON_GEN1_JAPANESE: Pokémon Gen I JapanesePOKEMON_GEN2_ENGLISH: Pokémon Gen II English
UTF16LE, UTF16BE relate to std::wstring, const wchar_t*, and wchar_t* types.
UTF32LE, UTF32BE relate to std::u32string, const char32_t*, and char32_t* types.
All others relate to std::string, const char*, and char* types.
Creates an empty instance.
- str: input string of any supported type
- encoding: Encoding identifier of the input string
Creates a copy of anoher instance
- other: source instance
The left-hand instance becomes a copy of the right-hand one.
Converts an input string of one encoding type to another.
- Template Parameters:
inT: The type of the input string.outT: The type of the output string.
- Parameters
input: The input string of typeinTinputEncoding: Character encoding identifier of the input stringoutputEncoding: Character encoding identifier of the output string
- Returns: A string of type
outT, encoded asoutputEncoding.
std::wstring utf16le = MorphText::Convert<const char*, std::wstring>("an example", UTF8, UTF16LE);Note: If you want to convert the assigned string of a MorphText instane, simply return it with the GetString() function.
Creates an all-lowercase copy of the input string.
- Data types:
inT: Any supported string type
- Parameters
input: The input string of typeinTinputEncoding: Character encoding identifier of the input string
- Returns: A string of type
inTencoded asinputEncodingin all lowercase
std::string utf8 = MorphText::ToLower("Make Lowercase", UTF8);Creates an all-uppercase copy of the input string.
- Data types:
inT: Any supported string type
- Parameters
input: The input string of typeinTinputEncoding: Character encoding identifier of the input string
- Returns: A string of type
inTencoded asinputEncodingin all uppercase
std::string utf8 = MorphText::ToUpper("make uppercase", UTF8);Creates a sarcastic copy of the input string.
- Data types:
inT: Any supported string type
- Parameters
input: The input string of typeinTinputEncoding: Character encoding identifier of the input string
- Returns: A string of type
inTencoded asinputEncodingwith sarcastic energy
std::string utf8 = MorphText::ToSarcasm("you shouldn't be using camelcase for your projects", UTF8);Compares two strings for equality.
- Data types:
inT: Any supported string type
- Parameters
lhs: Left-hand side stringrhs: Right-hand side stringcaseSensitive: whether to consider case sensitivityencoding: Encoding identifier of the input strings
- Returns: true if both strings are identcal, otherwise false.
Comparing C-style strings might be faster
bool match = MorphText::Compare("test", "test", true, UTF8);Compares the instance against another string for equality.
- Data types:
inT: Any supported string type
- Parameters
rhs: Right-hand side stringcaseSensitive: whether to consider case sensitivityencoding: Encoding identifier of the input strings
- Returns: true if both strings are identcal, otherwise false.
Comparing C-style strings might be faster.
bool match = MorphText::Compare("Test", false, ASCII);Finds the occurence of a subset string within a superset string.
- Data types:
inT: Any supported string type
- Parameters
superset: String that may contain the substringsubset: Substring that may appear within the superset stringcaseSensitive: whether to consider case sensitivityencoding: Encoding identifier of the input strings
- Returns: The position of the subset appearing within the superset. Returns -1 if the subset has no occurence. If subset is empty 0 is being returned.
Finding C-style strings might be faster.
int pos = MorphText::Find("where banana?", "banana", true, ASCII);Finds the occurence of a subset string within the instance.
- Data types:
inT: Any supported string type
- Parameters
subset: Substring that may appear within the instancecaseSensitive: whether to consider case sensitivityencoding: Encoding identifier of the input strings
- Returns: The position of the subset appearing within the superset. Returns -1 if the subset has no occurence. If subset is empty 0 is being returned.
Finding C-style strings might be faster.
int pos = MorphText::Find("banana", false, ASCII);Returns the instance's string by the desired encoding identifier.
- Template Parameter
T: string type
- Parameter
encoding: Encoding identifier of the output strings
- Returns: The instance's string in the desired encoding and string type
MorphText test("ニコニコ二ー", UTF8);
test.GetString<std::string>(SHIFT_JIS);Sets the instance's string in the desired encoding identifier.
- Datatype
T: string type
- Parameter
encoding: Encoding identifier of the input strings
- Returns: The instance's string in the desired encoding and string type
MorphText test;
test.SetString("ニコニコ二ー", UTF8);Sets the instance's string in the desired encoding identifier.
- Parameter
encoding: Encoding identifier of the input strings
A test function that prints all class members. Only available in debug mode.
A test function that runs all functions. Only available in debug mode.
Required namespace: System.Runtime.InteropServices
The Following shows how to define all conversion functions in your C# class:
//char* to char*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Ansi)]
private static extern IntPtr ConvertCharStringToCharStringUnsafe(byte[] input, int inputEncoding, int outputEncoding);
// char* to wchar_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Ansi)]
private static extern IntPtr ConvertCharStringToWcharStringUnsafe(byte[] input, int inputEncoding, int outputEncoding);
// char* to u32char_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Ansi)]
private static extern IntPtr ConvertCharStringToWU32charStringUnsafe(byte[] input, int inputEncoding, int outputEncoding);
// wchar_t* to char*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
private static extern IntPtr ConvertWcharStringToCharStringUnsafe(char[] input, int inputEncoding, int outputEncoding);
// wchar_t* to wchar_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
private static extern IntPtr ConvertWcharStringToWcharStringUnsafe(char[] input, int inputEncoding, int outputEncoding);
// wchar_t* to char32_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
private static extern IntPtr ConvertWcharStringToU32charStringUnsafe(char[] input, int inputEncoding, int outputEncoding);
// char32_t* to char*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern IntPtr ConvertU32charStringToCharStringUnsafe(UInt32[] input, int inputEncoding, int outputEncoding);
// char32_t* to wchar_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Auto)]
private static extern IntPtr ConvertU32charStringToWcharStringUnsafe(UInt32[] input, int inputEncoding, int outputEncoding);
// char32_t* to char32_t*
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern IntPtr ConvertU32charStringToU32charStringUnsafe(UInt32[] input, int inputEncoding, int outputEncoding);The following functions must be used to free the allocated memory of the converted strings (output strings)
//free C++ char*/C# byte[] string
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern void FreeMemoryCharPtr(IntPtr ptr);
//free C++ wchar_t*/C# string
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern void FreeMemoryWcharPtr(IntPtr ptr);
//free C++ char32_t*/C# UInt32[] string
[DllImport("MorphText.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern void FreeMemoryU32charPtr(IntPtr ptr);Usage examples:
//signle-byte characters to C# string
Byte[] utf8 = new Byte[11] { 0x4d, 0x45, 0x4f, 0xc3, 0x96, 0xc3, 0x9c, 0xc3, 0x84, 0x57, 0x00 };
IntPtr resPtr = ConvertCharStringToWcharStringUnsafe(utf8, 1 /*utf8*/, 2 /*utf16 little endian*/);
StringBuilder txt = new StringBuilder(Marshal.PtrToStringUni(resultPtr));
string utf16 = txt;
FreeMemoryWcharPtr(resultPtr);
//C# string to C# string
char[] utf16BE = new char[3] { 0x4700, 0x5300, 0x3300 };
IntPtr resPtr = ConvertWcharStringToWcharStringUnsafe(utf8, 3 /*utf16 big endian*/, 2 /*utf16 little endian*/);
StringBuilder txt = new StringBuilder(Marshal.PtrToStringUni(resultPtr));
string utf16 = txt;
FreeMemoryWcharPtr(resultPtr);
//quatrouple-byte characters to C# string
UInt32[] utf8 = new UInt32[4] { 0x30d00000, 0xdf000000, 0x45f40100, 0 };
IntPtr resPtr = ConvertU32charStringToWcharStringUnsafe(utf8, 1 /*utf8*/, 2 /*utf16 little endian*/);
StringBuilder txt = new StringBuilder(Marshal.PtrToStringUni(resultPtr));
string utf16 = txt;
FreeMemoryWcharPtr(resultPtr);- check if double-byte characters of Shift-Jis are stored in LE on LE machines
- check if double-byte characters of KS X 1001 are stored in BE on BE machines and in LE on LE machines
- public static C-String type conversion specialization (convertToUTF8, convertFromUTF8)
- fix convertToUTF8(), convertFromUTF8(), Convert() to be able to use references of std::string, std::wstring, and std::u32string
- Pokémon character encodings (Gen II and later + spin-offs)
- add Shift-Jis CP10001/2000, Shift-Jis CP10001/2016
- improve ToLower, ToUpper, ToSarcasm functions by specializing them and considering characters like umlauts, full-width letters, etc
- improve comparisons by specializing them for each encoding
- make member comparison overloads for c-style input string work
- specialize findRaw() function for any other encoding than ASCII or any other UTF type to consider umlauts, fullwidth letters, etc for case insensitivity
- test on a big-endian system
- add necessary endianness checks to UTF-16 and UTF32-operations
- Lawn Meower: Idea, Code
- sozysozbot: original KS X 1001 table
- Bulbapedia wiki at Bulbagarden: Documenting the Pokémon character encodings