Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slug transliteration #194

Closed
tobyzerner opened this issue Jul 28, 2015 · 77 comments
Closed

Slug transliteration #194

tobyzerner opened this issue Jul 28, 2015 · 77 comments

Comments

@tobyzerner
Copy link
Contributor

Ex: https://chanphom.com/forums/luat-choi-chan-pro.29/ from "Luật chơi Chắn Pro"

@dcsjapan
Copy link
Contributor

Transliteration is possible for many languages, but very difficult or impossible for a few languages (like Japanese). It would be best if there were a way to enable/disable this function; or barring that, percent encoding of unicode might be preferable as a more universally applicable solution.

@tobyzerner
Copy link
Contributor Author

Currently slugs are generated using only alphanumeric characters, replacing anything else with a hyphen. However we should support some degree of transliteration so non-Latin languages still get slugs. This is an area where I don't have much knowledge, and help would be appreciated.

What needs to be done:

  • Work out a transliteration strategy (i.e. a library, or is there anything in PHP's standard library?) that supports a wide range of alphabets.
  • Discuss the possibility of leaving unicode characters in slugs, for languages where transliteration is impossible. What are the problems with this, if any?
  • Depending on the strategy we decide upon, consider implementing a mechanism that allows language packs to turn transliteration on/off.
  • While we're here, we should also truncate long slugs to a maximum of 50 or so characters.

@wielski
Copy link

wielski commented Sep 10, 2015

Maybe you can use library like this one?
https://github.com/ashtokalo/php-translit

@Buhito72
Copy link

In Spanish, the mod_rewrite replaces all Latin characters like ñ, accents, etc. with a hyphen. In order to improve the SEO would be better to rewrite the equivalent characters, for example: español ---> espanol (instead of espa-ol), corazón ---> corazon (instead of coraz-n). It can be done with a simple replacement of characters.

]/', '/[-]+/', '/<[^>]*>/'); $repl = array('', '-', ''); $url = preg_replace ($find, $repl, $url); return $url; } ?>

@ISilvaPT
Copy link

ISilvaPT commented Oct 4, 2015

Same could be said for Portuguese:
ã | â | á | à > a
ê | é | è | > e
í | ì | > i
õ | ô | ó | ò > o
ú | ù > u
ç > c

@dcsjapan
Copy link
Contributor

dcsjapan commented Oct 8, 2015

As I mentioned above and in #557, transliteration isn't a complete solution. There are some languages that can't be transliterated very easily, or at all.

In the case of Japanese, as I mentioned in Stumbling block 6, it would take a lot of rather sophisticated processing to come up reliable transliterations of words spelled using Chinese characters. And even the most sophisticated program will be reduced to guessing when it comes to things like names, which can use Chinese characters in nonstandard ways.

Japanese is clearly an extreme case, but even where the relationship between pronunciation and spelling tends to be more stable, there are still difficulties. To transliterate Chinese reliably, for example, you would need to provide a glossary of at least several thousand characters. So it's not always a matter of applying a few well-defined rules.

In regions where transliteration is impractical, there is a strong trend toward the use of unicode in URLs. Flarum will have to support that, or it will simply be irrelevant in those regions. At the same time, however, Flarum also needs to offer transliteration for regions that have adopted that approach.

My suggestion is:

Admins should be allowed to specify whether URLs should be transliterated or encoded. This could be implemented as an administrator setting, though it might be better still to have the question asked and answered during the installation process.

When an admin chooses the former, a library such as this one suggested by @FirestarterUA could be used to transliterate all slugs, including thread titles, tag names, and usernames. (Flarum may need to check all these items and return an error whenever any non-transliteratable text is entered. Or we could leave it up to admins to tell their users: "Don't use any Chinese characters ... or else!")

When an admin chooses the latter, all URLs are encoded appropriately, with only an absolute minimum of character replacement (e.g. hyphens in place of spaces) being performed.

@johannsa
Copy link
Contributor

Why not using same approach as Wikipedia and allow use of unicode in slugs which is supported by modern browsers and also by part of Flarum's frontend? This way many character sets would be available.

Also, currently slugs for discussions are generated on the client which is not ideal. They should be generated on the server (and stored on the database like tag slugs are).

@dcsjapan
Copy link
Contributor

Why not using same approach as Wikipedia and allow use of unicode in slugs which is supported by modern browsers and also by part of Flarum's frontend?

I think that would be a great solution ... I'd just like to be sure there aren't any SEO implications for admins in regions where transliteration is the accepted approach.

@franzliedke
Copy link
Contributor

As discussed in #646, we can use Stringy which gives us slugging functionality for free.

@franzliedke
Copy link
Contributor

We might also want to truncate the slug after a certain length.

@thecotne
Copy link

i want to mention here that for georgian language slugs are not generated at all (from this "რა კაი ფორუმი წამოვჭიმეთ!" i got "--" this slug)
and also Wikipedia approach is best for slugs

@akalongman
Copy link

+1
@tobscure We need unicode slugs

@dsevillamartin
Copy link
Member

This looks good for different languages: Cocur/Sluglify. The only problem is that it needs the language to be fully spelled out, instead of en it needs english, although that is probably an easy fix.
The other one I found which doesn't need a language, is Jbroadway/urlfix, although that one is more basic, I think.
Whichever is better ;)

@dcsjapan
Copy link
Contributor

dcsjapan commented Apr 4, 2016

Of the transliteration options mentioned, Slugify strikes me as the most worthy of consideration. It covers a wide range of languages out of the box, can easily customized to cover more, and is flexible when it comes to integration.

As @franzliedke said, Stringy may also be an option, especially if it can also be employed for tasks other than transliteration. One cause for concern is that it only does slugification, not true transliteration; that is, it seems to work on a fixed ruleset:

Converts the string into an URL slug. This includes replacing non-ASCII characters with their closest ASCII equivalents, removing remaining non-ASCII and non-alphanumeric characters, and replacing whitespace with $replacement.

This may not provide the best transliterations for all languages; converting ä to a would not work in a language where ae is the more commonly used transliteration. A more language-specific solution would give better results vis-a-vis both SEFiness and human readability.

I'm wondering whether it would be possible to use Stringy, but insert language-specific rulesets (like the ones used by Slugify) when available. We could put the ruleset file right in the language pack, as we've done with Moment.js translations. When the admin sets the forum's slugification style to "transliteration" (as opposed to "UTF-8") Flarum would grab the ruleset for the forum's default language and slugify based on that. If the language pack is lacking a ruleset, it could fall back to standard Stringy slugification.

Would something like this be possible?

EDIT: It would be best to have Stringy treat the language-specific ruleset as overrides, so it can default to its own slugification rules when it encounters a character that's not covered in the ruleset being used. That would allow it to cope with situations involving characters not included in the ruleset for the default language ... such as a topic about Søren Kierkegaard in a French forum.

This solution would be best suited to single-language forums. Handling of thread titles (etc.) in more than one language would tend to be hit-and-miss. And in cases where a forum includes languages requiring different slugification methods ... Russian and Japanese, for example ... the admin will be forced to use UTF-8 slugs. The only way around that would be to make Flarum truly multilingual, i.e. assign a locale value to each thread.

@franzliedke franzliedke modified the milestone: 0.1.0 Apr 7, 2016
@yihui
Copy link

yihui commented Apr 29, 2016

As a Chinese speaker, I'd just want a simple option to disable slugs of posts. I don't want either transliteration or Unicode characters in the URLs. Personally I also prefer shorter URLs like example.com/d/12345 instead of example.com/d/12345-hello-world Having Unicode Chinese characters in the URL will make it horribly long and messy like https://zh.wikipedia.org/wiki/Portal:%E6%96%B0%E8%81%9E%E5%8B%95%E6%85%8B when you copy the URL from the address bar of the browser (e.g. Chrome). That is not human readable, so such slugs will be useless. I think disabling transliteration is much easier to implement and more useful to Chinese users.

@dcsjapan
Copy link
Contributor

dcsjapan commented Apr 29, 2016

Safari and Firefox are able to copy the URL in human-readable format. When I open the URL you linked above and copy it from the Safari address bar, I get this:

https://zh.wikipedia.org/wiki/Portal:新聞動態

So this should probably be considered a deficiency of Chrome ... or of your OS, perhaps. That said, a third option to disable slugs altogether shouldn't be too hard to implement, and may be wanted by enough site admins that it would be worth adding.

@believer-ufa
Copy link

believer-ufa commented May 15, 2016

Hello guys :) You hear about PHP Intl Transliterator extension?

For example, you can use this snippet of code for transliterate any strings to latin characters (even japanese characters, as I know)

<?php
$rules = 'Any-Latin; Latin-ASCII; [\u0080-\uffff] remove';

echo transliterator_transliterate($rules,'Какая-то строка, которая нуждается в транслитерации');
// Kakaa-to stroka, kotoraa nuzdaetsa v transliteracii

echo transliterator_transliterate($rules,'新聞動態');
// xin wen dong tai

echo transliterator_transliterate($rules,'რა კაი ფორუმი წამოვჭიმეთ');
// ra kai porumi tsamovchimet

You can find more info about this transliterator functions in sources of Yii 2 framework, for example.

@believer-ufa
Copy link

believer-ufa commented May 15, 2016

Also in page with description of Intl extension you can find message of one of php developers in which it is written one of possible solutions to transform string into the correct transliterated url:

<?php
function slugify($string) {
    $string = transliterator_transliterate("Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC; [:Punctuation:] Remove; Lower();", $string);
    $string = preg_replace('/[-\s]+/', '-', $string);
    return trim($string, '-');
}

echo slugify("Я люблю PHP!"); // a-lublu-php
echo slugify('რა კაი ფორუმი წამოვჭიმეთ'); // ra-kʼai-porumi-tsʼamovchʼimet
echo slugify('新聞動態'); // xin-wen-dong-tai
?>

I think, it need to test on some count of strings to choose the more correct method :)

@franzliedke
Copy link
Contributor

franzliedke commented May 14, 2017

Hmm, the URL without slug is already understood: https://discuss.flarum.org/d/187.

That means that only the URL generation code has to be adapted.

@Zeokat
Copy link
Contributor

Zeokat commented May 15, 2017

@franzliedke Yes, slugs with discussion-id-only are already understood and also gives us some duplicated content because both urls returns "HTTP status 200" without any redirection (301) that search engines can understand. Anyway, that's another history.

I'm speaking about the lines of code that add the dash after discussion-id slug (for example, on empty slugs the autogenerated slug is https://discuss.flarum.org/d/5772-- , which seems a little ugly).

Anyway here we go: #1183

@buiductuan182
Copy link

Maybe you can use library like this one? https://www.quangminhhanoi.com/dieu-hoa-daikin

@renato
Copy link

renato commented Dec 10, 2018

Is this still planned to land on core or should we use extensions (there are 2 iirc) to solve this?

@franzliedke
Copy link
Contributor

Still planned, the ticket is still open. 😉

@renato
Copy link

renato commented Dec 11, 2018

I know it's open, but tagged as "needs-discussion". Does it still need discussion, even after #1385 (unfortunately abandoned)? Or we could use the same approach used there (Illuminate\Support\Str::slug)?

@franzliedke
Copy link
Contributor

Feel free to send a new PR that takes the changes from #1385 and applies them to the current code. The original author unfortunately did not react anymore.

@franzliedke
Copy link
Contributor

Another option: https://github.com/sunrise-php/slugger

@ivangretsky
Copy link

ivangretsky commented Mar 29, 2019

While this feature is still being discussed, just wanted to mention, that there is a working extension for transliteration for beta 8.1 supported by Friends of Flarum. It is actually a fork of this one. Thanks, people!

@luceos
Copy link
Member

luceos commented Oct 15, 2019

@Zeokat
Copy link
Contributor

Zeokat commented Oct 15, 2019

But it requires extension intl if i'm not wrong 😑

@clarkwinkelmann
Copy link
Member

Suggestion: put slug generator in the container with an interface for easier extension.

That way extensions like FoF Transliterator can extend it instead of listening for an event and overriding the value.

This way if another part of core uses slugs server side, it will also use the same logic.

Sadly tags is another issue because they use the client side slug() method in utils/string.

franzliedke added a commit that referenced this issue Jan 24, 2020
This is better than the current system, as it adds transliteration rules
for special characters, rather than just throwing all of them away.

For languages that cannot be transliterated to ASCII in a reasonable
manner, more possible improvements are outlined in #194.
@franzliedke
Copy link
Contributor

Okay, for anyone interested, we're finally making some progress here:

If #1975 is merged, we will have a basic transliteration implementation, based on what Laravel brings along. As discussed in detail in this issue, this is great for some, but not helpful for other languages, so more work still needs to be done.

As it doesn't make the current situation (auto-generating slugs) worse for anybody, as far as I can tell, but is an improvement for languages where transliteration makes sense (e.g. German), I think this is a solid improvement that we can make without tackling all the other things that could be done.

That doesn't mean we want to stop here, though. Improving how we cater to international audiences is very important to us (and also very eye-opening).


Based on my understanding of everything that was said in this issue, here is what I would propose as next steps / challenges. Once we agree on these, I would suggest to create separate subtickets that can be scheduled for different releases:

  • Support for different slugging strategies, configurable via the admin panel - initially support the current strategy (internationalized transliteration) and the null strategy (no slugs at all)
  • Enforce (and configure?) maximum length of slugs
  • Allow language packs / extensions to provide custom strategies which admins can select
  • An additional option for keeping Unicode characters in the slugs (research + admin setting)

If we manage to make some progress on each of these, I would be very content. 😅

Of course, one could go above and beyond with support for language-specific slugging based on the (auto-detected?) language of a discussion, but I would say that's too much for core and clearly extension territory.

@Zeokat
Copy link
Contributor

Zeokat commented Jan 24, 2020

I have been waiting this for years and for me @franzliedke plan seems good. I didn't tested Laravel's slugger, but if it do its work good will be at least one step forward 👍

What i want to add here, is what will happen with tags and usernames slugs, because they maybe also need transliterated or not.

luceos pushed a commit that referenced this issue Feb 4, 2020
This is better than the current system, as it adds transliteration rules
for special characters, rather than just throwing all of them away.

For languages that cannot be transliterated to ASCII in a reasonable
manner, more possible improvements are outlined in #194.
@luceos luceos added this to the 0.1 milestone Feb 14, 2020
wzdiyb pushed a commit to wzdiyb/core that referenced this issue Feb 16, 2020
This is better than the current system, as it adds transliteration rules
for special characters, rather than just throwing all of them away.

For languages that cannot be transliterated to ASCII in a reasonable
manner, more possible improvements are outlined in flarum#194.
@askvortsov1
Copy link
Member

Closing this as solved by flarum/issue-archive#203, extensions can now introduce custom slug drivers to allow any approach imaginable.

If a custom attribute is used to store a new transliterated / modified slug, the Saving event can be used to set / update that attribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.