Tutorial here. Input file in .tsv format with uniq-id, image-id, caption, predicted object labels (taken from VinVL, not used), image base64 string are separated by ...