Skip to content

Latest commit

 

History

History

data

Here is an example of pre-training data. There are two files "raw_multidata_of_product_preatrain.small_train" and "raw_multidata_of_product_preatrain.small_valid", containing about 500 samples. Each line contains 5 fields: Item ID, Item title, Item image url, Item KG, Item category (which is not used when pre-training K3M model).

Take one line of data as an example: "631432564317 家用户型圆桌圆形折叠桌简约餐桌老式简易桌子4人8人小吃饭桌子大 https://img.alicdn.com/imgextra/https://img.alicdn.com/imgextra/i1/2209527442205/O1CN014C9neM1S9vxl3cyha_!!0-item_pic.jpg 省份#:#河北省#;#款式定位#:#经济型#;#附加功能#:#多功能#;#人造板种类#:#密度板/纤维板#;#地市#:#廊坊市#;#区县#:#安次区#;#材质#:#人造板#;#是否可定制#:#否#;#出租车是否可运输#:#是#;#风格#:#简约现代 折叠桌"

where,

Note:

  1. If you need your own data to pretrain the model, please construct the data in the above format. The five fields in each row are separated by '\t'.

  2. The "Item KG" field consists of "property-value" pairs of item, and different "property-value" pairs are separated by '#;#', the property and its corresponding value are separated by '#:#'. In the example, the property-value pair "省份#:#河北省" represents a triple <item, 省份, 河北省>.

  3. Since the "Item category" field is not used in the pre-training stage, it can be set to any value in your own data.