Skip to content

Commit 71e283f

Browse files
author
mir hossein
authored
cleanup codes
1 parent ceaa3e1 commit 71e283f

File tree

1 file changed

+22
-322
lines changed

1 file changed

+22
-322
lines changed

README.md

Lines changed: 22 additions & 322 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,40 @@
11

2-
# Pelerea
2+
# :white_check_mark: 🇮🇷 🇮🇷 🇮🇷 Regex for Persian (Farsi) Language 🇮🇷 🇮🇷 🇮🇷
33

4-
## :white_check_mark: Regular expressions for Persian aka Farsi language 🇮🇷
54

6-
#### Regular expressions for validating, sanizing and filtering strings when it must be Persian.
5+
#### Collection of Regex for validating, filtering, sanitizing and finding Persian strings.
6+
7+
8+
### Introduction
9+
10+
11+
Because of historical matters, many Arabic characters get a way into Persian language and transformed it, In these years many efforts have been made by government and non-governmental organizations to revivification of authority of Persian language and this is one of them.
12+
713

814
#### :eight_pointed_black_star: Notes
915

10-
* because of historical matters, many Arabic characters get a way into Persian language and transformed it,
11-
but in these years many efforts made by government and non-government organizations to make a personality for Persian language.
1216

13-
* Persian alphabet consists of 32 characters but for above reasons there are five more Arabic characters that used in many old text so they are supported in the regex
17+
* Persian alphabet consists of 32 characters and 3 vowel marks, but for above reasons there are 6 more Arabic characters and 8 more vowel marks that are being used in many texts.
18+
19+
20+
* The important part of this effort, is codepoints range, so you can create your own regex for validating, filtering and finding strings, just put the desired range in it.< br>
21+
for example when string should only contains persian words and spaces just concat space codepoints and persian alpha codepoints in the final Regex and so on.
1422

15-
* The important part of this effort, is codepoints range, so you can create your own regex in any way you want just put the desired codepoints range in it
1623

17-
* All patterns only pass one word means characters with no space if you want to run patterns against more than one word
18-
just concat space pattern to desired patterns
24+
* Characters in table are sorted by codepoints
1925

26+
* See tests after reading.
27+
28+
---
2029

21-
### :black_square_button: Codepoints range
30+
31+
### :black_square_button: Codepoints Range
2232

2333

2434
### :white_square_button: Space
2535

2636

27-
includes all kind of space specially zero-width space that massively use in Persian
37+
This ranges include all kind of space, specially zero-width space that massively are using in Persian texts.
2838

2939

3040
```
@@ -34,9 +44,6 @@ U+2028-U+202F
3444
```
3545

3646

37-
#### :small_orange_diamond: Allowed characters
38-
39-
4047
code point | character | hex | name
4148
-----------|-----------|--------|---------------------
4249
U+0020 | |20 |SPACE
@@ -57,311 +64,4 @@ U+200D ‍ | |e2 80 8d| ZERO WIDTH JOINER
5764
U+200E ‎ | |e2 80 8e| LEFT-TO-RIGHT MARK
5865
U+200F ‏ | |e2 80 8f| RIGHT-TO-LEFT MARK
5966
U+2028 | |e2 80 a8| LINE SEPARATOR
60-
U+2029 
 | |e2 80 a9| PARAGRAPH SEPARATOR
61-
U+202A ‪ | |e2 80 aa| LEFT-TO-RIGHT EMBEDDING
62-
U+202B ‫ | |e2 80 ab| RIGHT-TO-LEFT EMBEDDING
63-
U+202C ‬ | |e2 80 ac| POP DIRECTIONAL FORMATTING
64-
U+202D ‭ | |e2 80 ad| LEFT-TO-RIGHT OVERRIDE
65-
U+202E ‮ | |e2 80 ae| RIGHT-TO-LEFT OVERRIDE
66-
U+202F   | |e2 80 af| NARROW NO-BREAK SPACE
67-
68-
69-
#### :small_orange_diamond: How use it
70-
71-
72-
```python
73-
input = ' '
74-
space_codepoints =r'\u0020\u2000-\u200F\u2028-\u202F'
75-
result=re.search('^['+space_codepoints+persian_alpha_codepoints+']+$', input)
76-
if result:
77-
...
78-
```
79-
80-
81-
```php
82-
$onput = ' ';
83-
$space_codepoints = '\x{0020}\x{2000}-\x{200F}\x{2028}-\x{202F}';
84-
$result = preg_match('/^['.$space_codepoints.$persian_alpha_codepoints.'+$/u', input);
85-
if($result)
86-
...
87-
```
88-
89-
---
90-
### :white_square_button: Persian alphabet (37 characters)
91-
92-
93-
```
94-
U+0621-U+0623
95-
U+0627-U+063A
96-
U+0641-U+0642
97-
U+0644-U+0648
98-
U+0686
99-
U+0698
100-
U+06A9-U+06AF
101-
U+06BE
102-
U+06CC
103-
```
104-
105-
106-
#### :small_orange_diamond: Allowed characters
107-
108-
109-
code point | character | hex | name
110-
-----------|-----------|-------|---------------------
111-
U+0621 | ء | d8 a1 | ARABIC LETTER HAMZA
112-
U+0622 | آ | d8 a2 | ARABIC LETTER ALEF WITH MADDA ABOVE
113-
U+0623 | أ | d8 a3 | ARABIC LETTER ALEF WITH HAMZA ABOVE
114-
U+0627 | ا | d8 a7 | ARABIC LETTER ALEF
115-
U+0628 | ب | d8 a8 | ARABIC LETTER BEH
116-
U+0629 | ة | d8 a9 | ARABIC LETTER TEH MARBUTA
117-
U+062A | ت | d8 aa | ARABIC LETTER TEH
118-
U+062B | ث | d8 ab | ARABIC LETTER THEH
119-
U+062C | ج | d8 ac | ARABIC LETTER JEEM
120-
U+062D | ح | d8 ad | ARABIC LETTER HAH
121-
U+062E | خ | d8 ae | ARABIC LETTER KHAH
122-
U+062F | د | d8 af | ARABIC LETTER DAL
123-
U+0630 | ذ | d8 b0 | ARABIC LETTER THAL
124-
U+0631 | ر | d8 b1 | ARABIC LETTER REH
125-
U+0632 | ز | d8 b2 | ARABIC LETTER ZAIN
126-
U+0633 | س | d8 b3 | ARABIC LETTER SEEN
127-
U+0634 | ش | d8 b4 | ARABIC LETTER SHEEN
128-
U+0635 | ص | d8 b5 | ARABIC LETTER SAD
129-
U+0636 | ض | d8 b6 | ARABIC LETTER DAD
130-
U+0637 | ط | d8 b7 | ARABIC LETTER TAH
131-
U+0638 | ظ | d8 b8 | ARABIC LETTER ZAH
132-
U+0639 | ع | d8 b9 | ARABIC LETTER AIN
133-
U+063A | غ | d8 ba | ARABIC LETTER GHAIN
134-
U+0641 | ف | d9 81 | ARABIC LETTER FEH
135-
U+0642 | ق | d9 82 | ARABIC LETTER QAF
136-
U+0644 | ل | d9 84 | ARABIC LETTER LAM
137-
U+0645 | م | d9 85 | ARABIC LETTER MEEM
138-
U+0646 | ن | d9 86 | ARABIC LETTER NOON
139-
U+0647 | ه | d9 87 | ARABIC LETTER HEH
140-
U+0648 | و | d9 88 | ARABIC LETTER WAW
141-
U+0686 | چ | da 86 | ARABIC LETTER TCHEH
142-
U+0698 | ژ | da 98 | ARABIC LETTER JEH
143-
U+06A9 | ک | da a9 | ARABIC LETTER KEHEH
144-
U+06AF | گ | da af | ARABIC LETTER GAF
145-
U+06BE | ھ | da be | ARABIC LETTER HEH DOACHASHMEE
146-
U+06CC | ی | db 8c | ARABIC LETTER FARSI YEH
147-
148-
149-
#### :small_orange_diamond: How use it
150-
151-
152-
```python
153-
input = 'این یک تست است'
154-
persian_alpha_codepoints = '\u0621-\u0623\u0627-\u063A\
155-
\u0641-\u0642\u0644-\u0648\u0686\u0698\
156-
\u06A9-\u06AF\u06BE\u06CC'
157-
result=re.search('^['+space_codepoints+persian_alpha_codepoints+']+$', input)
158-
if result:
159-
...
160-
```
161-
162-
163-
```php
164-
$input = 'این یک تست است';
165-
$persian_alpha_codepoints = '\x{0621}-\x{0623}\x{0627}-\x{063A}
166-
\x{0641}-\x{0642}\x{0644}-\x{0648}
167-
\x{0686}\x{0698}\x{06A9}-\x{06AF}\x{06BE}\x{06CC}';
168-
$result = preg_match('/^['.$space_codepoints.$persian_alpha_codepoints.'+$/u', $input);
169-
if($result)
170-
...
171-
```
172-
173-
---
174-
### :white_square_button: Persian numbers
175-
176-
177-
```
178-
U+06F0-U+06F9
179-
```
180-
181-
182-
code point | character | hex | name
183-
-----------|-----------|-------|---------------------
184-
U+06F0 | ۰ | db b0 | EXTENDED ARABIC-INDIC DIGIT ZERO
185-
U+06F1 | ۱ | db b1 | EXTENDED ARABIC-INDIC DIGIT ONE
186-
U+06F2 | ۲ | db b2 | EXTENDED ARABIC-INDIC DIGIT TWO
187-
U+06F3 | ۳ | db b3 | EXTENDED ARABIC-INDIC DIGIT THREE
188-
U+06F4 | ۴ | db b4 | EXTENDED ARABIC-INDIC DIGIT FOUR
189-
U+06F5 | ۵ | db b5 | EXTENDED ARABIC-INDIC DIGIT FIVE
190-
U+06F6 | ۶ | db b6 | EXTENDED ARABIC-INDIC DIGIT SIX
191-
U+06F7 | ۷ | db b7 | EXTENDED ARABIC-INDIC DIGIT SEVEN
192-
U+06F8 | ۸ | db b8 | EXTENDED ARABIC-INDIC DIGIT EIGHT
193-
U+06F9 | ۹ | db b9 | EXTENDED ARABIC-INDIC DIGIT NINE
194-
195-
196-
#### :small_orange_diamond: How use it
197-
198-
199-
```python
200-
input = '۲۱۳'
201-
persian_num_codepoints = '\u06F0-\u06F9'
202-
result=re.search('^['+persian_num_codepoints+']+$', input)
203-
if result:
204-
...
205-
```
206-
207-
208-
```php
209-
$input = '۲۱۳';
210-
$persian_num_codepoints = '[\x{06F0}-\x{06F9}]';
211-
$result = preg_match('/^['.$persian_num_codepoints.'+$/u', $input);
212-
if($result)
213-
...
214-
```
215-
216-
---
217-
### :white_square_button: Persian(Arabic) punctuation marks
218-
219-
220-
```
221-
U+060C
222-
U+061B
223-
U+061F
224-
U+0640
225-
U+066A
226-
U+066B
227-
U+066C
228-
```
229-
230-
231-
code point | character | hex | name
232-
-----------|-----------|-------|---------------------
233-
U+060C | ، | d8 8c | ARABIC COMMA
234-
U+061B | ؛ | d8 9b | ARABIC SEMICOLON
235-
U+061F | ؟ | d8 9f | ARABIC QUESTION MARK
236-
U+0640 | ـ | d9 80 | ARABIC TATWEEL
237-
U+066A | ٪ | d9 aa | ARABIC PERCENT SIGN
238-
U+066B | ٫ | d9 ab | ARABIC DECIMAL SEPARATOR
239-
U+066C | ٬ | d9 ac | ARABIC THOUSANDS SEPARATOR
240-
241-
242-
243-
#### :eight_pointed_black_star: for more common punctutation marks like `” | « | » | ?| ; | : | ...` <br> see [general punctuation page in unicode](https://en.wikipedia.org/wiki/List_of_Unicode_characters#General_Punctuation)
244-
245-
246-
247-
```python
248-
test = 'این یک نوشته تست است؟'
249-
punctuation_marks_codepoints = '\u060C\u061B\u061F\u0640\u066A\u066B\u066C'
250-
result=re.search('^['+space_codepoints+punctuation_marks_codepoints+persian_alpha_codepoints']+$', input)
251-
if result:
252-
...
253-
```
254-
255-
256-
```php
257-
$punctuation_marks_codepoints = '\x{060C}\x{061B}\x{061F}\x{0640}\x{066A}\x{066B}\x{066C}';
258-
$result = preg_match('/^['.$space_codepoint.+$punctuation_marks_codepoints.$punctuation_marks_codepoints.'+$/u', $input)
259-
if($result)
260-
...
261-
```
262-
263-
---
264-
### :white_square_button: Additional Arabic characters and vowels marks
265-
266-
267-
:eight_pointed_black_star: these characters used in old Persian text
268-
269-
270-
```
271-
U+0624-U+0626
272-
U+0643
273-
U+0649-U+0655
274-
U+06D5
275-
```
276-
277-
278-
code point | character | hex | name
279-
-----------|-----------|-------|---------------------
280-
U+0624 | ؤ | d8 a4 | ARABIC LETTER WAW WITH HAMZA ABOVE
281-
U+0625 | إ | d8 a5 | ARABIC LETTER ALEF WITH HAMZA BELOW
282-
U+0626 | ئ | d8 a6 | ARABIC LETTER YEH WITH HAMZA ABOVE
283-
U+0643 | ك | d9 83 | ARABIC LETTER KAF
284-
U+0649 | ى | d9 89 | ARABIC LETTER ALEF MAKSURA
285-
U+064A | ي | d9 8a | ARABIC LETTER YEH
286-
U+064B | ً | d9 8b | ARABIC FATHATAN
287-
U+064C | ٌ | d9 8c | ARABIC DAMMATAN
288-
U+064D | ٍ | d9 8d | ARABIC KASRATAN
289-
U+064E | َ | d9 8e | ARABIC FATHA
290-
U+064F | ُ | d9 8f | ARABIC DAMMA
291-
U+0650 | ِ | d9 90 | ARABIC KASRA
292-
U+0651 | ّ | d9 91 | ARABIC SHADDA
293-
U+0652 | ْ | d9 92 | ARABIC SUKUN
294-
U+0653 | ٓ | d9 93 | ARABIC MADDAH ABOVE
295-
U+0654 | ٔ | d9 94 | ARABIC HAMZA ABOVE
296-
U+0655 | ٕ | d9 95 |ARABIC HAMZA BELOW
297-
U+06D5 | ە | db 95 | ARABIC LETTER AE
298-
299-
300-
#### :small_orange_diamond: How use it
301-
302-
303-
```python
304-
input = 'ؤ'
305-
additional_arabic_characters_and_vowels_marks_codepoint = '\u0624-\u0626\u0643\u0649-\u0655\u06D5'
306-
result=re.search('^['+additional_arabic_characters_and_vowels_marks_codepoint+']+$', input)
307-
if result:
308-
...
309-
```
310-
311-
312-
```php
313-
$input = 'ؤ';
314-
$additional_arabic_characters_and_vowels_marks_codepoint ='\x{0624}-\x{0626}x{0643}\x{0649}-\x{0655}\x{06D5}';
315-
$result = preg_match('/^['.$additional_arabic_characters_vowl_codepoint.'+$/u', $input)
316-
if($result)
317-
...
318-
```
319-
320-
---
321-
### :white_square_button: Arabic numbers
322-
323-
324-
```
325-
U+0660-U+0669
326-
```
327-
328-
329-
code point | character | hex | name
330-
-----------|-----------|-------|---------------------
331-
U+0660 | ٠ | d9 a0 | ARABIC-INDIC DIGIT ZERO
332-
U+0661 | ١ | d9 a1 | ARABIC-INDIC DIGIT ONE
333-
U+0662 | ٢ | d9 a2 | ARABIC-INDIC DIGIT TWO
334-
U+0663 | ٣ | d9 a3 | ARABIC-INDIC DIGIT THREE
335-
U+0664 | ٤ | d9 a4 | ARABIC-INDIC DIGIT FOUR
336-
U+0665 | ٥ | d9 a5 | ARABIC-INDIC DIGIT FIVE
337-
U+0666 | ٦ | d9 a6 | ARABIC-INDIC DIGIT SIX
338-
U+0667 | ٧ | d9 a7 | ARABIC-INDIC DIGIT SEVEN
339-
U+0668 | ٨ | d9 a8 | ARABIC-INDIC DIGIT EIGHT
340-
U+0669 | ٩ | d9 a9 | ARABIC-INDIC DIGIT NINE
341-
342-
343-
#### :large_orange_diamond: How use it
344-
345-
346-
```python
347-
input='١٥٤٦'
348-
arabic_numbers_codepoint = '\u0660-\u0669'
349-
result=re.search('^['+arabic_numbers_codepoints+']+$', input)
350-
if result:
351-
...
352-
```
353-
354-
355-
```php
356-
$input='١٥٤٦';
357-
$arabic_numbers_codepoints ='\x{0660}-\x{0669}';
358-
$result = preg_match('/^['.$additional_arabic_characters_codepoints.'+$/u', $input)
359-
if($result)
360-
...
361-
```
362-
363-
364-
365-
:checkered_flag::checkered_flag::checkered_flag: waiting to see your collaboration
366-
367-
67+
U+2029

0 commit comments

Comments
 (0)