Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic example for word_cloud #315

Closed
wants to merge 5 commits into from
Closed

Arabic example for word_cloud #315

wants to merge 5 commits into from

Conversation

caleighm
Copy link
Contributor

Here's a little Arabic example - I noticed a couple issues that folks had with Arabic, and in one of them you responded to say it would be helpful to have an example in the documentation. So, here's one!

@amueller
Copy link
Owner

awesome, thanks! Can you please post the image here, too?
It looks like you used google translate, which is a nice idea. But maybe it would be better to use a wikipedia entry (of some uncontroversial subject, say, like arabic.

@amueller
Copy link
Owner

Also, it would be good if someone who actually speaks arabic can look over it. From #303 it seems there was some problems with the direction of the text, and from #70 it seems there are problems with tokenization, see #70 (comment)

It's not clear to me whether your example addresses these issues. Also, does it work for python2 and python3?

@caleighm
Copy link
Contributor Author

caleighm commented Oct 27, 2017

arabic_reshaper handles displaying contextual glyphs (vs individual letters) and bidi-algorithm handles the direction. The blog linked to in #70 (http://mpcabd.xyz/python-arabic-text-reshaper/ although it's down right now? - find the arabic_reshaper repo here: https://github.com/mpcabd/python-arabic-reshaper) explains it pretty well!

My teammates who speak Arabic have said this method works but you're right, I can't tell myself! I'm not even confident enough to grab the Arabic wikipedia entry without bungling up the words somehow, unless you just meant to copy/paste the intro.

I've added the resulting image, but it's from the original Google Translate text - not the Wikipedia page right now.

I'll ask one of my teammates to format that text and then I can upload that and re-do the image.

@amueller
Copy link
Owner

that would be really great, thank you. The image is pretty low resolution right now. Maybe add a link to the reshaper website in a comment to the code?

@caleighm
Copy link
Contributor Author

caleighm commented Oct 27, 2017

Yup, will do. I believe the low resolution appearance is a feature of the font I'm using. unifont is a bitmap font -- it actually covers every glyph in UTF-8 (which is A LOT of glyphs!). It's pretty basic-looking. I could have used an Arabic specific font, which looks a lot nicer, but the nice thing about unifont is that it ensures any UTF-8 encoded text (no matter the language) should be okay in wordcloud (well, things like directionality and context - like Arabic - have to be handled separately but that's what this Arabic example is for!)

https://en.wikipedia.org/wiki/GNU_Unifont

@amueller
Copy link
Owner

how about using gnu freefont? http://ftp.gnu.org/gnu/freefont/ https://en.wikipedia.org/wiki/GNU_FreeFont
That might actually be a better default font than the Droid I have now...

@caleighm
Copy link
Contributor Author

I'll take a look at the font tomorrow - would be great to have a more beautiful but very thorough unicode font available. I am new to working with Arabic so I'm still learning a lot about encoding!

I'll also get that Wikipedia article in Arabic to update the examples.

@bakrianoo
Copy link

@caleighm

Hi

I just solved this problem by following these steps:

1 - in the library site_packages folder. I replaced the "DroidSansMono.ttf" font file with a new one with keeping the same file name.
The problem that this font file does not support Arabic . for example, I used this font:
https://brushez.com/sc-ameen-2.html

Or the popular"arial" font
http://www5.miele.nl/apps/vg/nl/miele/mielea02.nsf/0e87ea0c369c2704c12568ac005c1831/07583f73269e053ac1257274003344e0?OpenDocument

2- I used each of these libraries to solve Arabic characters direction issue.

  • python-bidi
  • arabic_reshaper

then applied this code:

from bidi.algorithm import get_display
import matplotlib.pyplot as plt
import arabic_reshaper
from wordcloud import WordCloud

text = u"انا احب اللغة العربية و حروفها I love English words"
reshaped_text = arabic_reshaper.reshape(text)
artext = get_display(reshaped_text)

wordcloud = WordCloud().generate(artext)
wordcloud.to_image()

Result

@caleighm
Copy link
Contributor Author

caleighm commented Nov 3, 2017

I used Arial font like you said @bakrianoo! It looks great. I'm just not sure what the copyright is around it, so I've kept Unicode in the examples/font folder since we're allowed to distribute that one for sure.

Also updated the text sample so it's the Wikipedia entry as requested.

@amueller
Copy link
Owner

amueller commented Nov 4, 2017

I'm pretty sure Arial is copyrighted (and possibly other intellectual property protection) by Microsoft, so using the Unicode font is a better idea. Thanks @caleighm, I'll check it out in a bit.

@amueller
Copy link
Owner

amueller commented Nov 4, 2017

Have you tried this font? https://en.wikipedia.org/wiki/GNU_FreeFont

@bakrianoo
Copy link

bakrianoo commented Nov 4, 2017

@amueller
I tried (GNU_FreeFont) this already before, but the results was full of missed characters.

Result

By the way, does github send notifications to follow your replays here guys. Or how can I follow my replays here without re-exploring browser history logs eveytime ?

@caleighm
Copy link
Contributor Author

caleighm commented Nov 4, 2017

@amueller @bakrianoo I've similarly had no success with GNU Freefont - seems to be missing some characters.

What I've done is provided a word cloud example that uses Arial (so it's more beautiful), but in my sample code I left the font selection to be Unifont (which is provided in the fonts folder). However, I put in some comments saying that the provided word cloud image was made with Arial, it's just that we are unable to distribute that font in this repo.

My understanding is that we can use Arial for personal use but obviously can't distribute it to others. So creating a word cloud image and sharing that image with others, with the image using Arial, should be okay, but actively sharing Arial font is not ok.

@amueller
Copy link
Owner

amueller commented Nov 6, 2017

@bakrianoo I'm not sure I understand your question but on the right hand side you can "subscribe" to the thread for email notifications.

Example using Arabic
===============
Generating a wordcloud from Arabic text
Other dependencies: bidi.algorithm, arabic_reshaper
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe say how to install those. you need to pip install python-bidi, which is not entirely obvious.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this code with huge Arabic data but i get this error:

 '%s not allowed here' % _ch['type']

AssertionError: RLI not allowed here

Can you resolve it please ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abdulrahman44
Try removing RLI from the string
code.docx

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lotemn102
How ? can you explain more , please ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lotemn102
How ? can you explain more , please ?

By replacing every RLI in your text with an empty string.
Have you seen the code.docx file i've added to my previous comment? (can't copy RLI into here)

This problem actually has nothing to do with this repository, so if you have more questions about it, please contact me at lotemn102@gmail.com or try to ask in stackoverflow

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks .

@amueller
Copy link
Owner

amueller commented Nov 6, 2017

ok so my suggestion in terms of fonts would be to merge several of the noto fonts. Maybe latin, arabic and chinese for now?

@amueller
Copy link
Owner

amueller commented Nov 6, 2017

download

This is using noto sans regular, merged with noto sans arabic. Does that look ok? Would your prefer a serif for arabic? It looks like merging in chinese is not really an option because it's too big.

@amueller
Copy link
Owner

amueller commented Nov 6, 2017

@amueller
Copy link
Owner

amueller commented Nov 6, 2017

If you look for arabic here: https://www.google.com/get/noto/
you can see sans, naskh, kufi and more. I don't know which one would be most appropriate.

@amueller
Copy link
Owner

amueller commented Nov 6, 2017

@amueller
Copy link
Owner

amueller commented Nov 6, 2017

Other possibilities: https://www.typotheque.com/fonts/arabic/latin

@caleighm
Copy link
Contributor Author

Sorry I've been MIA, was at a conference - will look more into this tomorrow night:

  1. Find a nicer font from your above suggestions
  2. Add install instructions for python-bidi and arabic_reshaper -- I think we could just provide links to those repos; I think we should probably not be providing installation instructions for other packages in word_cloud -- what if the install instructions for python-bidi or arabic_reshaper get updated? Then we'd have to update our example, too. But if we just provide links to the source, that means people who follow the link will always get most up-to-date instructions

@AMR-KELEG
Copy link
Contributor

AMR-KELEG commented Sep 16, 2019

Is it possible to make word_cloud support unicode characters by default?

@amueller
Copy link
Owner

@AMR-KELEG what do you mean by that? It does support unicode characters by default, but you need to specify a font that includes them. We could provide a warning message if characters are not included but I'm not aware of a way to ship a font for all of unicode. The only one I'm aware of is noto, which would be many GB and not feasible to include in the package.

@AMR-KELEG
Copy link
Contributor

@amueller Yes, I meant adding a font to the repository.
A warning message for sure is nice to have.
I also think adding a section to the README file mentioning how to add fonts (with some links to the fonts) for different languages would be great.

@amueller
Copy link
Owner

@AMR-KELEG PR to the readme welcome. I'd also welcome a PR that adds a warning message if the font doesn't support some characters. Ideally the warning would point towards the readme/the docs with instructions to installing additional fonts.

@amueller
Copy link
Owner

I saw that there's been some comments, I think having an example in arabic would actually be great. I don't remember entirely what the status of this PR is. I'd be happy to merge it in the current state and then maybe we can iterate from there? The examples will probably not work because Circle doesn't have the arabic reshaper installed. It might be nice to fix that and then merge?

@iamaziz
Copy link

iamaziz commented May 15, 2020

Hi everyone, thanks for the good discussion!
FWIW, here is a complete example:

from collections import Counter

from wordcloud import WordCloud          # pip install wordcloud
import matplotlib.pyplot as plt          
# -- Arabic text dependencies
from arabic_reshaper import reshape      # pip install arabic-reshaper
from bidi.algorithm import get_display   # pip install python-bidi

rtl = lambda w: get_display(reshape(f'{w}'))

COUNTS = Counter("السلام عليكم ورحمة الله و بركاته السلام كلمة جميلة".split())
counts = {rtl(k):v for k, v in COUNTS.most_common(10)}

font_file = './NotoNaskhArabic-Regular.ttf' # download from: https://www.google.com/get/noto
wordcloud = WordCloud(font_path=font_file).generate_from_frequencies(counts)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

The result:

image

@amueller
Copy link
Owner

@iamaziz is that much different from what's in this PR? Sorry I lost track, this should really go into the examples.
If you want to pick up the PR and make sure it runs on CI that would be great!

@iamaziz
Copy link

iamaziz commented May 15, 2020

@amueller sure, I can give it a try. It's gonna add two more dependencies arabic_reshaper and python-bidi to the requirements.txt if that's cool? And the font file as well.

@amueller
Copy link
Owner

Please only add them to the CI installation files, not to the main requirements.txt

@iamaziz
Copy link

iamaziz commented May 15, 2020

Sure, is the CI installation under requirements-dev.txt?

@AMR-KELEG
Copy link
Contributor

AMR-KELEG commented May 15, 2020

Sure, is the CI installation under requirements-dev.txt?

I believe that you should modify this line to install dependencies for testing : https://github.com/amueller/word_cloud/blob/master/.circleci/config.yml#L22
https://github.com/amueller/word_cloud/blob/master/scikit-ci.yml

I am not sure how each of these files is used.

@iamaziz
Copy link

iamaziz commented May 15, 2020

Ah cool, thanks! Yea, I was not sure either. I was poking around in .travis.yml thought I would add something like pip install python-bidi arabic_reshaper somewhere under this line?

https://github.com/amueller/word_cloud/blob/master/.travis.yml#L49

@iamaziz
Copy link

iamaziz commented May 17, 2020

Just added a tiny wrapper for this example here. It's prob hacky 😅 but it works. For now, it's like

$ pip install ar_wordcloud
from ar_wordcloud import ArabicWordCloud
awc = ArabicWordCloud()
awc.from_text(...)

@amueller
Copy link
Owner

The relevant part in the circle config is here:
https://github.com/amueller/word_cloud/blob/master/.circleci/config.yml#L81

Which calls this script:
https://github.com/amueller/word_cloud/blob/master/doc/build-website.sh

But basically you just need to add the dependencies here:
https://github.com/amueller/word_cloud/blob/master/doc/requirements-doc.txt

@AMR-KELEG
Copy link
Contributor

The relevant part in the circle config is here:
https://github.com/amueller/word_cloud/blob/master/.circleci/config.yml#L81

Which calls this script:
https://github.com/amueller/word_cloud/blob/master/doc/build-website.sh

But basically you just need to add the dependencies here:
https://github.com/amueller/word_cloud/blob/master/doc/requirements-doc.txt

Oh, that's nice. Will the example be added automatically to the site after building it?

On the other hand, I believe that the current commits needs some tiny modifications/ bug fixes.
Do you know how can the commits be modified and applied to this PR?
I can't think of a way to do so.
Should we just create a new Pull Request, merge it into master and then close this Pull Request?
This solution won't give credit to the authors of this Pull Request.
What do you think?

@amueller
Copy link
Owner

@AMR-KELEG if you add an example to the examples folder and the filename starts with plot_ it will automatically be added to the website by the sphinx-gallery plugin.

And you can just create a new PR indeed. You can put the original commits into your new PR to give credit.

@amueller amueller deleted the branch amueller:master May 8, 2023 17:00
@amueller amueller closed this May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants