Skip to content

This is a crude and non-official implemention of the model in the paper :Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning (ACM MM2020)

License

Notifications You must be signed in to change notification settings

TownWilliam/mma_sr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mma_sr

This is a crude and non-official implemention of baseline model-mma sr in the paper :Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning (ACM MM2020)https://dl.acm.org/doi/10.1145/3394171.3413753

image

The code is based on the mmf-framework, which is a popular framework from Facebook AI Research. The other interfaces and packages are here: https://github.com/facebookresearch/mmf

In this paper, it introduces a model based on the LSTM and attention module, and improves its performance with the spatial relationship. However, I did not implement this component for spatial relationship beacuse of the task that finding the next neighbours ocr' = N(ocr) for the each OCR token has many conditions to discuss.

All the research work belongs to those researchers, I just write it with pytorch by my own for the follow-up study. I find the author of the paper has published a new paper in https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_Improving_OCR-Based_Image_Captioning_by_Incorporating_Geometrical_Relationship_CVPR_2021_paper.pdf (CVPR 2021). The structure of the model is similar with the mma_sr and has more powerful performance in OCR-based image captioning task.However, its code is still not open souce.

If you want to use this program, you just need to install the previous mmf and then put this mma_sr.py in mmf folder location: mmf/mmf/models. The code of this model is modified by the m4c.py in mmf. The reason why I use this kind of framework is that the dataset processsing in the OCR-based image captioning task is a bit of exhausting. So I directly write the programme in this mmf-framework.

My programming ability is not professional so I just write the code as this paper says. However, there are some gap on metrics result in the paper.

Model BLEU-4 Metor Rouge-L Spice CIDEr
baseline in paper 24.0 22.6 47.0 15.6 93.7
baseline(my code) 22.3 21.5 45.0 14.5 86.9
baseline(my code,beam=5) 24.0 21.6 46.4 14.6 90.9

Perhaps there is something goes wrong with my code. I could not find out which part in my code is false. If you could find out the problem in the code, I would appreciate it very much.

For other componet, I would also modify the code in the later.

About

This is a crude and non-official implemention of the model in the paper :Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning (ACM MM2020)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages