Using a logistic regression model to solve some real-world problems, we can explore different possibilities and outcomes.
Similar to linear regression, logistic regression maps elements in the input space to discrete outputs. It answers what the (log odds of) probability that an array of grayscale pixels from the human face represents a specific gender, age, and ethnicity.
Decision tree/random forest can also be used to solve this problem since these models work well with the dataset used here.
Initialize project: scrapy startproject mySpider
Start scraping: scrapy runspider mySpider.py
- Spiders are responsible for propagating requests and parse the response to item;
- Requests are sent to the remote server with random user-agent;
- Pipelines receive parsed items and process them one-by-one, and this is where the data are stored to MongoDB;
- Items with
images_url
are captured byImagesPipeline
, and sent toProxyUADownloaderMiddleware
; - Rejected downloading requests are resent using different proxies from the IP pool;
The dataset is obtained from scraping the page https://generated.photos/faces/. The images are downloaded, stored in MongoDB and to reduce the pressure to my laptop, they are resized to 64x64 and made into grayscale. Labels are extracted from the link to the image, and they're stored along with the link to the image. The outline of the database is shown below. (key resembles what is in the dataset)
key | type | note |
---|---|---|
pixels | List[] | |
gender | str | gender |
age | str | age |
ethnicity | str | ethnicity |
emotion | str | emotion |
hair_color | str | hair_color |
eye_color | str | eye_color |
hair_length | str | hair_length |
Please contact me to acquire the dataset :)
- Out of 20000 pairs of tests, the model correctly predicts 97.24% of them.
- This set of data is not biased in gender.
The McFadden's rho-squared is shown as follow:
and for the theta null, we set it to be an array of all zeros except for the leading one.McFadden's pseudo-R-squared: 0.31467801021392716
In the book - Bahvioural Travel Modelling, edited by David Hensher and Peter Stopher, 1979 - McFadden contributed Ch. 15, "Quantitative Methods for Analyzing Travel Behaviour on Individuals: Some Recent Developments." Discussion of model evaluation (in the context of multinomial logit models) begins on page 306, where he introduces ρ2 (McFadden's pseudo R2). McFadden states, "while the R2 index is a more familiar concept to planners who are experienced in OLS, it is not as well behaved as the ρ2 measure, for ML estimation. Those unfamiliar with ρ2 should be forewarned that its values tend to be considerably lower than those of the R2 index... For example, values of 0.2 to 0.4 for ρ2 represent EXCELLENT fit."
- The closer the curve is to the upper-left corner, the better performance the model does; This is because we want the model to have a higher TPR while having a lower FPR.
- The closer the AUC to 1, the better the prediction this model can do.
Run python visualizer.py
to discover!
- Train with RGB data since it has a better result in predicting eye_color, hair_color, and ethnicity.
- Try using
scrapy_splash
to render JS. - Implement a decision tree for the same sake.