Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weighted names and surnames #168

Open
atnbueno opened this issue Jul 22, 2020 · 3 comments
Open

Weighted names and surnames #168

atnbueno opened this issue Jul 22, 2020 · 3 comments

Comments

@atnbueno
Copy link
Contributor

I think this is pretty universal: some names and surnames are more common than others. It would be a bit more realistic if we could assign weights to names and surnames and imitate their real distribution.

@keitharm
Copy link
Member

Sounds like a good feature, but it may be a bit complicated to implement in terms of finding a source that can cover the surnames and their associated weights of all the nationalities.

@atnbueno
Copy link
Contributor Author

atnbueno commented May 17, 2022

What I had in mind was a backwards compatible solution:

Solution #​1: allow numbers after the names/surnames, and if present, use them as weight (no number meaning weight=1)

Solution #​2: allow repeated names in the lists (equivalent to integer weights) and just pick one random line

If there's no weight information, the data stays as is. If someone finds a weighted source for a particular version, they can do a PR.

@atnbueno
Copy link
Contributor Author

atnbueno commented Dec 1, 2022

I've found official stats about names and surnames in Spain, and I've assembled weighted lists for male and female names, as well as surnames:

Antonio           38          Maria Carmen      38          Garcia     57
Manuel            34          Maria             34          Rodriguez  36
Jose              33          Carmen            21          Gonzalez   36
Francisco         28          Ana Maria         16          Fernandez  36
David             22          Maria Pilar       15          Lopez      34
Juan              20          Laura             15          Martinez   33
Javier            19          Josefa            15          Sanchez    32
Jose Antonio      18          Isabel            15          Perez      31
Daniel            18          Maria Dolores     15          Gomez      19
Francisco Javier  17          Maria Teresa      14          Martin     19
...                           ...                           ...
Emilio Jose        1          Elizabeth          1          Asensio     1
Jose Andres        1          Meritxell          1          Reina       1
Simon              1          Desiree            1          Polo        1
Luis Antonio       1          Gregoria           1          Ojeda       1
                1000 TOTAL    Antonia Maria      1          Ramon       1
                              ...                           ...
                              Maria Manuela      1          Carrera     1
                              Mia                1          Toledo      1
                              Maria Candelaria   1          Ayala       1
                              Maria Gracia       1          Alcaraz     1
                                              1000 TOTAL    Hernando    1
                                                            ...
                                                            Mejias      1
                                                            Carvajal    1
                                                            Rosales     1
                                                            Toro        1
                                                                     1000 TOTAL

After looking at the API code, it looks like simply using a list with 38 "Antonio" lines, 34 "Manuel", etc. would work without any code change.

Should I do a PR with such lists?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants