We developed an AI model that quantifies similarity and dissimilarity between art pieces by using language as an intermediary - in order to address the limitations of comparing art purely visually. We essentially aim to capture the contextual nuances of artworks in language descriptions. Here is our model workflow - where paintings are processed through various layers to map onto distinct ordering of predefined attributes (show example for an image). These attributes are then utilized to compute dissimilarity scores between art pairs using our similarity functions.
We present research-based evidence to demonstrate the effectiveness and consistency of our similarity functions by using art-analysis domain-specific proofs.
We wanted to test our hypothesis that across one artist’s portfolio, pieces created closer together in time would be more similar than pieces created further apart in time. To test this, we took 20 Picasso paintings over a timespan of 40 years. We put them through our model and plotted the dissimilarity scores between each pair of images. We found that an increase in absolute time difference between paintings was correlated with a higher dissimilarity score, supporting our hypothesis.
Another hypothesis that we tested was that paintings from a single artist's portfolio would have lower dissimilarity scores when compared to each other than when compared to art from a different artist. We assumed that an artist's work tends to be more uniform in style, and that our pretrained model would be able to detect those nuances and rank the comparisons accordingly. To test this, we ran a sample of 12 paintings, between Raphael and Frida Kahlo, through our dissimilarity function and plotted the scores. As expected, the average dissimilarity scores of paintings by the same artist were consistently lower than paintings between different artists. This pattern also consisted between any two artists that we’ve tested from the WikiArt dataset.
We'll now exhibit the model's use case proficiency in identifying the most similar artworks from a diverse set of 10 pieces, unraveling connections across time periods, cultures, and styles.
Across these 10 images, our model produced the following most similar pairs the following themes:
This project aims to explore the application of machine learning in the realm of art analysis by using advanced techniques to extract themes and nuances from images. The primary motivation behind this idea is to offer new insights into artistic content, thereby enhancing our understanding of art.
One of our key implementation ideas is to develop a novel approach that serves as an inverse to existing methods, such as Stable Diffusion models. By converting images into textual representations and subsequently condensing them into coherent (synthetically designed) themes using autoencoders, we seek to unveil latent artistic motifs that may not be immediately apparent in a given art using natural language as the fabric. This could help quantify similarities between different artistic works across countries, cultures, and periods.
Furthermore, we also seek to enhance the interpretability of our neural network implementation and understand how our model derives meaning from images. Through the incorporation of skip connections and analysis of intermediate token representations in the hidden layers, we want to show how our model derives meaning from visual data through the natural language evolution of the embeddings in each subsequent hidden layer. This is also going to integrate a level of transparency into our model.
Our project also harbors the potential to evolve into a research idea as we intend to investigate performance disparity between image reconstruction using convolutional layer embed- dings and textual representations as the intermediary medium, and document which approach is better at capturing the various kinds of artistic nuances. Additionally, we also want to explore if using images as an additional layer of reconstruction can improve the natural language summarization/compression capability of our autoencoder, in contrast to simply training our autoencoder on only textual description reconstruction.
In conclusion, our project aims to create a tool that performs classification on images using natural language as the medium. By implementing interpretability and synergy between image and text encoding, we also intend to research autoencoder performance and token summarization performance for image-to-text-to-image and text-to-text reconstruction.
One challenge we encountered during the project was devising a method to compress a matrix composed of BERT vectors into a matrix composed of Word2Vec vectors while retaining as much information as possible. This task involved compressing a matrix of vectors with dimensions (n, 768) into a matrix with dimensions (n, 300), while ensuring minimal loss of information. Initially, we attempted Principal Component Analysis (PCA), a common technique for dimensionality reduction. However, we discovered that PCA only effectively compressed the data when n was greater than 768, whereas our dataset often involved smaller n values, typically less than or equal to 768. Subsequently, we explored Singular Value Decomposition (SVD) as an alternative approach, but encountered similar limitations with n values.
After further research, we experimented with two other methods: random projection and averaging com- pression. Random projection, a straightforward technique that projects data onto a lower-dimensional sub- space using a random matrix, proved robust for dimensionality reduction, especially with small sample sizes. However, quantifying the amount of preserved data with random projection posed challenges, although it could be addressed using the Johnson-Lindenstrauss lemma. Additionally, we implemented a simpler averaging com- pression method, dividing the original 768-dimensional vectors into equal-length segments and computing the average of each segment to derive a lower-dimensional representation. However, neither of these methods are expected to preserve as much data as PCA or SVD.
In the early stages of the project, we aimed to develop comprehensive solutions from scratch, intending to extract descriptions from images and identify the most significant keywords and themes from these descriptions. However, despite investing substantial time and effort into training our own models and seeking guidance from a UMass ML professor, we encountered challenges in achieving meaningful semantic results. Recognizing the limitations imposed by our timeframe, we concluded that leveraging pretrained models was essential for efficient progress. Consequently, we opted to utilize existing pretrained models, such as Word2Vec and BERT, and integrated the OpenAI API for obtaining image descriptions. While we may explore further fine-tuning or even training our own models in the future, we prioritized using these established methods due to their demonstrated effectiveness in delivering promising results, as evidenced in our project outcomes.
Furthermore, in our objective to compress our data using Word2Vec or other predefined embeddings, we encountered another obstacle. While attempting to compress labels conveying general semantics, we discovered that Word2Vec embeddings lacked contextual information from their surroundings, thus limiting their effectiveness. This realization made us understand one of the primary motivations behind the use of attention mechanisms for context transfer among tokens in natural language processing tasks. This prompted us to explore alternative approaches for data compression, emphasizing the importance of context-aware embeddings for preserving semantic meaning in our model’s representations - we even tried to create our own attention based mechanism from scratch!
In the development of our pre-trained model, we also encountered challenges when utilizing prompt engi- neering to extract predefined attributes for image comparison. Despite our efforts to prompt GPT-4 to strictly adhere to the specified attributes, we encountered difficulties as the model frequently injected or replaced attributes not included in our predefined lists. Despite multiple attempts to guide GPT-4 through explicit prompts, we found that language prompting alone was insufficient to enforce adherence to our attribute parameters. To address this issue, we devised a validation function to filter out any attributes not included in our predefined list, replacing missing attributes with placeholders at the end of the list. This approach effectively prioritized the predefined attributes used by GPT-4, as the attribute order is significant for our comparison tasks. While this solution enables us to work with our predefined attributes, we recognize the potential for alternative prompting strategies or further model tuning to ensure GPT-4 remains within the specified parameter constraints, an area that requires further exploration and refinement from our end.
Our final experiment was to create our own CNN to classify images by style. Despite training our model on 5,000 paintings, tweaking the architecture, and tuning the hyperparameters, we were not able to achieve consistently decreasing loss. Although this may have been due to our model’s shortcomings, we think it supports the power of interpreting art with a language intermediary as implemented in our model.