CSV export revision (Adding new fields and removing some)

CSV has not evolved since a long time. Some people need new data in the CSV. This issue allows to discuss which new fields could be exported.

**Fields to add:**
- `ingredients_analysis_tags`
  - description: this field is computed from ingredients' analysis to know if the product is or not vegan, vegetarian, and/or with palm-oil
  - Eg. `ingredients_analysis_tags: ["en:palm-oil","en:non-vegan","en:vegetarian-status-unknown"]`
  - rationale: it can be a helper to control data quality

- `nutrient_levels_tags`?
  - description: "It represents the traffic lights system made by the UK FSA (Food Safety Administration). It is used by some manufacturers on a voluntary basis in Great Britain, but was rejected by the European Commission in 2010. On Open Food Facts, when the nutritional values are known, the traffic lights are displayed on the product pages. The calculation formula defining the colors of the lights is described on a [dedicated page of Open Food Facts website](https://world.openfoodfacts.org/nutrition-traffic-lights). Nutrient levels can be found here on the website: https://world.openfoodfacts.org/nutrient-levels
  - Example: `nutrient_levels_tags: ["en:fat-in-high-quantity","en:saturated-fat-in-high-quantity","en:sugars-in-high-quantity","en:salt-in-low-quantity"]`
  - rationale: it can be a helper to control data quality

- `product_quantity`
  - description: "This is the normalized quantity of the product in grams (ISO system)."
  - Example: `product_quantity: "1500"`, computed from `quantity: "1,5 L"`; `product_quantity: "320"`, computed from `quantity: "2 x 160 g"`
  - rationale: it's easier to play with this data as Open Food Facts already do the computation. We already provide `serving_quantity`.

- `owner`
  - description: "This is the owner of the product, which have sent the product's data to Open Food Facts. The list of owners can be found at https://world.openfoodfacts.org/owners"
  - Example: `"owner: "org-carrefour"`
  - rationale: it can be a helper to control data quality

- `data_quality_errors_tags`
  - description: "Returns a list of all detected errors for the product."
  - Example: `data_quality_errors_tags: ["en:nutrition-saturated-fat-greater-than-fat"]` (speaks for itself).
  - rationale: 1. Open Food Facts reusers might want to remove the products with quality issues. 2. It can ease to build tools for data quality. It represents ~30K products as of 2022-10.

- `unique_scans_n` (contains the number of unique scans of a product (~33% of products))
  - description: "Returns an integer which represent the number of users who have scanned the product at least one time. "Users" are identified by different IPs. This value is not computed in real time but once a year."
  - Example: `unique_scans_n: "8"`. 
  - rationale: this is a good proxy to understand which are the most consumed products.

- `popularity_tags`
  - description: "The popularity of a product is computed thanks to its number of unique scans. The `popularity_tags` field groups products by different levels of popularity by year, either in the world, either in the countries where it is popular."
  - Example: `popularity_tags: ["top-50000-scans-2019","top-100000-scans-2019","at-least-5-scans-2019","at-least-10-scans-2019","top-75-percent-scans-2019","top-80-percent-scans-2019","top-85-percent-scans-2019","top-90-percent-scans-2019","top-50000-fr-scans-2019","top-100000-fr-scans-2019","top-country-fr-scans-2019","at-least-5-fr-scans-2019","at-least-10-fr-scans-2019"]`
  - rationale: this field might be more clear than `unique_scans_n`, as the latter could suggest this number is a fresh data if not real-time

- `completeness`:
  - description: completeness is a float number, between 0 and 1.1, measuring how complete is the product (the higher, the most complete). Currently, we check for more than 10 items: image completeness, `product_name`, `quantity`, `packaging`, `brands`, `categories`, `origins`, `emb_codes`, `expiration_date`, `ingreditents_text`, `nutriments` (or `no_nutrition_data` if it is `on`).
  - Eg. `completeness: 0.7625`

- `last_image_t`:
  - description: It is the date (Unix format) of the last image being uploaded for the product.
  - eg. `last_image_t: 1666661491`
  - rationale: it can be a helper to control data quality: if the product does not have any image, it should be impossible to fix data quality issues

**Fields to delete ??**
- should we keep both `created_t` / `lastmodified_t` vs `created_datetime` / `last_modified_datetime` ([51 Mb lost](http://mirabelle.openfoodfacts.org/products?sql=select+sum%28%0D%0A++++length%28created_t%29%2B%0D%0A++++length%28last_modified_t%29%0D%0A++%29%2F1000000%0D%0A++as+Mb+from+%5Ball%5D))
- `additives`, which is empty
- one of the `states*` three fields; `states` and `states_tags` [are almost identical, the only difference is that states contains spaces](http://mirabelle.openfoodfacts.org/products?sql=select+states%2C+states_tags%2C+states_en+from+%5Ball%5D+where+replace%28states%2C+%27+%27%2C+%27%27%29+%3C%3E+states_tags+order+by+states+limit+200)
- `brands` or `brands_tags`: the latter is only the normalized version of the first one (lowercased, unaccented, and replacing spaces and typographic signs by a "-")

**Fields that could evolve:**
- fields with URL take a huge weight
  - all image URLs begin by https://xxxxx.openfoodfacts.org/images/products/
  - It is [866 Mb of data as of 2022-10-12](http://mirabelle.openfoodfacts.org/products?sql=select+%28%0D%0A++++%28length%28%22https%3A%2F%2Fworld-en.openfoodfacts.org%2Fproduct%2F%22%29*count%28url%29%29%0D%0A++++%2B%0D%0A++++%28length%28%22https%3A%2F%2Fimages.openfoodfacts.org%2Fimages%2Fproducts%2F%22%29*%0D%0A+++++++%28count%28image_url%29%0D%0A++++++++%2Bcount%28image_small_url%29%0D%0A++++++++%2Bcount%28image_ingredients_url%29%0D%0A++++++++%2Bcount%28image_ingredients_small_url%29%0D%0A++++++++%2Bcount%28image_nutrition_url%29%0D%0A++++++++%2Bcount%28image_nutrition_small_url%29%0D%0A+++++++%29%29%0D%0A++%29%2F1000000%0D%0A++as+Mb+from+%5Ball%5D), representing ~13% of the CSV size.

**Process:**
- build a new CSV file as a beta version, with a different name than the current one, and let people discuss about it
- when beta version turn stable/official, keep old version and tell everyone that old version will be unavailable in xx months

**Implementation**
* First modify: `@export_fields` in https://github.com/openfoodfacts/openfoodfacts-server/blob/main/lib/ProductOpener/Config_off.pm
* Then modify export script: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/export_database.pl
* And finally end with the documentation: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/html/data/data-fields.txt

[to be completed]


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV export revision (Adding new fields and removing some) #2325

CharlesNepote
openedon Sep 13, 2019

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CSV export revision (Adding new fields and removing some) #2325

Description

CharlesNepoteopenedon Sep 13, 2019

Metadata