-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arabic example for word_cloud #315
Conversation
awesome, thanks! Can you please post the image here, too? |
Also, it would be good if someone who actually speaks arabic can look over it. From #303 it seems there was some problems with the direction of the text, and from #70 it seems there are problems with tokenization, see #70 (comment) It's not clear to me whether your example addresses these issues. Also, does it work for python2 and python3? |
arabic_reshaper handles displaying contextual glyphs (vs individual letters) and bidi-algorithm handles the direction. The blog linked to in #70 (http://mpcabd.xyz/python-arabic-text-reshaper/ although it's down right now? - find the arabic_reshaper repo here: https://github.com/mpcabd/python-arabic-reshaper) explains it pretty well! My teammates who speak Arabic have said this method works but you're right, I can't tell myself! I'm not even confident enough to grab the Arabic wikipedia entry without bungling up the words somehow, unless you just meant to copy/paste the intro. I've added the resulting image, but it's from the original Google Translate text - not the Wikipedia page right now. I'll ask one of my teammates to format that text and then I can upload that and re-do the image. |
that would be really great, thank you. The image is pretty low resolution right now. Maybe add a link to the reshaper website in a comment to the code? |
Yup, will do. I believe the low resolution appearance is a feature of the font I'm using. unifont is a bitmap font -- it actually covers every glyph in UTF-8 (which is A LOT of glyphs!). It's pretty basic-looking. I could have used an Arabic specific font, which looks a lot nicer, but the nice thing about unifont is that it ensures any UTF-8 encoded text (no matter the language) should be okay in wordcloud (well, things like directionality and context - like Arabic - have to be handled separately but that's what this Arabic example is for!) |
how about using gnu freefont? http://ftp.gnu.org/gnu/freefont/ https://en.wikipedia.org/wiki/GNU_FreeFont |
I'll take a look at the font tomorrow - would be great to have a more beautiful but very thorough unicode font available. I am new to working with Arabic so I'm still learning a lot about encoding! I'll also get that Wikipedia article in Arabic to update the examples. |
Hi I just solved this problem by following these steps: 1 - in the library site_packages folder. I replaced the "DroidSansMono.ttf" font file with a new one with keeping the same file name. Or the popular"arial" font 2- I used each of these libraries to solve Arabic characters direction issue.
then applied this code:
|
I used Arial font like you said @bakrianoo! It looks great. I'm just not sure what the copyright is around it, so I've kept Unicode in the examples/font folder since we're allowed to distribute that one for sure. Also updated the text sample so it's the Wikipedia entry as requested. |
I'm pretty sure Arial is copyrighted (and possibly other intellectual property protection) by Microsoft, so using the Unicode font is a better idea. Thanks @caleighm, I'll check it out in a bit. |
Have you tried this font? https://en.wikipedia.org/wiki/GNU_FreeFont |
@amueller By the way, does github send notifications to follow your replays here guys. Or how can I follow my replays here without re-exploring browser history logs eveytime ? |
@amueller @bakrianoo I've similarly had no success with GNU Freefont - seems to be missing some characters. What I've done is provided a word cloud example that uses Arial (so it's more beautiful), but in my sample code I left the font selection to be Unifont (which is provided in the fonts folder). However, I put in some comments saying that the provided word cloud image was made with Arial, it's just that we are unable to distribute that font in this repo. My understanding is that we can use Arial for personal use but obviously can't distribute it to others. So creating a word cloud image and sharing that image with others, with the image using Arial, should be okay, but actively sharing Arial font is not ok. |
@bakrianoo I'm not sure I understand your question but on the right hand side you can "subscribe" to the thread for email notifications. |
Example using Arabic | ||
=============== | ||
Generating a wordcloud from Arabic text | ||
Other dependencies: bidi.algorithm, arabic_reshaper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe say how to install those. you need to pip install python-bidi
, which is not entirely obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the official pages for both libraries:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://python-bidi.readthedocs.io/en/latest/ seems more informative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this code with huge Arabic data but i get this error:
'%s not allowed here' % _ch['type']
AssertionError: RLI not allowed here
Can you resolve it please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Abdulrahman44
Try removing RLI from the string
code.docx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lotemn102
How ? can you explain more , please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lotemn102
How ? can you explain more , please ?
By replacing every RLI in your text with an empty string.
Have you seen the code.docx file i've added to my previous comment? (can't copy RLI into here)
This problem actually has nothing to do with this repository, so if you have more questions about it, please contact me at lotemn102@gmail.com or try to ask in stackoverflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks .
ok so my suggestion in terms of fonts would be to merge several of the noto fonts. Maybe latin, arabic and chinese for now? |
You can download the file here: https://www.dropbox.com/s/j6uv2wiuu23asrs/NotoSans-Regular-merged-arabic.otf?dl=0 |
If you look for arabic here: https://www.google.com/get/noto/ |
Naskh: |
Other possibilities: https://www.typotheque.com/fonts/arabic/latin |
Sorry I've been MIA, was at a conference - will look more into this tomorrow night:
|
d5ef10d
to
ab85b24
Compare
Is it possible to make |
@AMR-KELEG what do you mean by that? It does support unicode characters by default, but you need to specify a font that includes them. We could provide a warning message if characters are not included but I'm not aware of a way to ship a font for all of unicode. The only one I'm aware of is noto, which would be many GB and not feasible to include in the package. |
@amueller Yes, I meant adding a font to the repository. |
@AMR-KELEG PR to the readme welcome. I'd also welcome a PR that adds a warning message if the font doesn't support some characters. Ideally the warning would point towards the readme/the docs with instructions to installing additional fonts. |
I saw that there's been some comments, I think having an example in arabic would actually be great. I don't remember entirely what the status of this PR is. I'd be happy to merge it in the current state and then maybe we can iterate from there? The examples will probably not work because Circle doesn't have the arabic reshaper installed. It might be nice to fix that and then merge? |
Hi everyone, thanks for the good discussion! from collections import Counter
from wordcloud import WordCloud # pip install wordcloud
import matplotlib.pyplot as plt
# -- Arabic text dependencies
from arabic_reshaper import reshape # pip install arabic-reshaper
from bidi.algorithm import get_display # pip install python-bidi
rtl = lambda w: get_display(reshape(f'{w}'))
COUNTS = Counter("السلام عليكم ورحمة الله و بركاته السلام كلمة جميلة".split())
counts = {rtl(k):v for k, v in COUNTS.most_common(10)}
font_file = './NotoNaskhArabic-Regular.ttf' # download from: https://www.google.com/get/noto
wordcloud = WordCloud(font_path=font_file).generate_from_frequencies(counts)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show() The result: |
@iamaziz is that much different from what's in this PR? Sorry I lost track, this should really go into the examples. |
@amueller sure, I can give it a try. It's gonna add two more dependencies |
Please only add them to the CI installation files, not to the main requirements.txt |
Sure, is the CI installation under |
I believe that you should modify this line to install dependencies for testing : https://github.com/amueller/word_cloud/blob/master/.circleci/config.yml#L22 I am not sure how each of these files is used. |
Ah cool, thanks! Yea, I was not sure either. I was poking around in https://github.com/amueller/word_cloud/blob/master/.travis.yml#L49 |
Just added a tiny wrapper for this example here. It's prob hacky 😅 but it works. For now, it's like
from ar_wordcloud import ArabicWordCloud
awc = ArabicWordCloud()
awc.from_text(...) |
The relevant part in the circle config is here: Which calls this script: But basically you just need to add the dependencies here: |
Oh, that's nice. Will the example be added automatically to the site after building it? On the other hand, I believe that the current commits needs some tiny modifications/ bug fixes. |
@AMR-KELEG if you add an example to the And you can just create a new PR indeed. You can put the original commits into your new PR to give credit. |
Here's a little Arabic example - I noticed a couple issues that folks had with Arabic, and in one of them you responded to say it would be helpful to have an example in the documentation. So, here's one!