emnlp2023.tex

% This must be in the first 5 lines to tell arXiv to use pdfLaTeX, which is strongly recommended.
\pdfoutput=1
% In particular, the hyperref package requires pdfLaTeX in order to break URLs across lines.

\documentclass[11pt]{article}

% Remove the "review" option to generate the final version.
\usepackage{EMNLP2023}

% Standard package includes
\usepackage{times}
\usepackage{latexsym}
\usepackage{float}

% For proper rendering and hyphenation of words containing Latin characters (including in bib files)
\usepackage[T1]{fontenc}
% For Vietnamese characters
% \usepackage[T5]{fontenc}
% See https://www.latex-project.org/help/documentation/encguide.pdf for other character sets

% This assumes your files are encoded as UTF8
\usepackage[utf8]{inputenc}

% This is not strictly necessary, and may be commented out.
% However, it will improve the layout of the manuscript,
% and will typically save some space.
\usepackage{microtype}

% This is also not strictly necessary, and may be commented out.
% However, it will improve the aesthetics of text in
% the typewriter font.
\usepackage{inconsolata}
% includegraphics
\usepackage{graphicx}
%listings
\usepackage{listings}
\usepackage{dirtytalk}

%review macros
\usepackage{xcolor}
\newcommand{\review}[1]{{\color{black}#1}}

%pandas tables
\usepackage{{booktabs}}
% Commands
\newcommand{\todo}[1]{{\color{red}\colorbox{yellow}{\textbf{TODO: }}#1}}
\newcommand{\averitec}{AVeriTeC}
\newcommand{\supp}{Supported}
\newcommand{\reff}{Refuted}
\newcommand{\nei}{Not enough evidence}
\newcommand{\conf}{Conflicting evidence/Cherrypicking}
\makeatletter
\newcommand\footnoteref[1]{\protected@xdef\@thefnmark{\ref{#1}}\@footnotemark}
\makeatother

% If the title and author information does not fit in the area allocated, uncomment the following
%
%\setlength\titlebox{<dim>}
%
% and set <dim> to something 5cm or larger.

\title{AIC CTU system at \averitec{}: Re-framing automated fact-checking as a simple RAG task}

% Author information can be set in various styles:
% For several authors from the same institution:
% \author{Author 1 \and ... \and Author n \\
%         Address line \\ ... \\ Address line}
% if the names do not fit well on one line use
%         Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
% For authors from different institutions:
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \And  ... \And
%         Author n \\ Address line \\ ... \\ Address line}
% To start a seperate ``row'' of authors use \AND, as in
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \AND
%         Author 2 \\ Address line \\ ... \\ Address line \And
%         Author 3 \\ Address line \\ ... \\ Address line}

\author{Herbert Ullrich \\
AI Center @ CTU FEE\\
Charles Square 13\\
Prague, Czech Republic\\
\texttt{ullriher@fel.cvut.cz} \\\And
Tomáš Mlynář \\
AI Center @ CTU FEE\\
Charles Square 13\\
Prague, Czech Republic\\
\texttt{mlynatom@fel.cvut.cz} \\ \\\And
Jan Drchal \\
AI Center @ CTU FEE\\
Charles Square 13\\
Prague, Czech Republic\\
\texttt{drchajan@fel.cvut.cz} \\}

\begin{document}
\maketitle
\begin{abstract}
This paper describes our $3^{rd}$ place submission in the \averitec{} shared task in which we attempted to address the challenge of fact-checking with evidence retrieved in the wild using a simple scheme of Retrieval-Augmented Generation (RAG) designed for the task, leveraging the predictive power of Large Language Models.
We release our codebase\footnote{\url{https://github.com/aic-factcheck/aic_averitec}}, and explain its two modules -- the Retriever and the Evidence \& Label generator -- in detail, justifying their features such as MMR-reranking and Likert-scale confidence estimation.
We evaluate our solution on \averitec{} dev and test set and interpret the results, picking the GPT-4o as the most appropriate model for our pipeline at the time of our publication, with Llama 3.1 70B being a promising open-source alternative.
We perform an empirical error analysis to see that faults in our predictions often \review{coincide} with noise in the data or ambiguous fact-checks, provoking further research and data augmentation.

\end{abstract}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% inputs
\input{src/introduction}
\input{src/system_description}
\input{src/classification}
\input{src/results}
%\input{src/software.tex}
\input{src/conclusions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section*{Limitations}
The evaluation of our fact-checking pipeline is limited to the English language and the \averitec{} dataset~\cite{averitec2024}. This is a severe limitation as the pipeline when deployed in a real-world application, would encounter other languages and forms of claims not covered by the used dataset.

Another limitation is that we are using a large language model. Because of that, future usage is limited to using an API of a provider of LLMs or having access to a large amount of computational resources, which comes at significant costs. Using APIs also brings the disadvantage of sending data to a third party, which might be a security risk in some critical applications. LLM usage also has an undeniable environmental impact because of the vast amount of electricity and resources used.

The reliability of the generated text is a limitation that is often linked to LLMs. LLMs sometimes hallucinate (in our case, it would mean using sources other than those given in the system prompt), and they can be biased based on their extensive training data. Moreover, because of the dataset size, it is impossible to validate each output of the LLM, and thus, we are not able to 100\% guarantee the quality of the results.

\section*{Ethics statement}
It is essential to note that our pipeline is not a real fact-checker that could do a human job but rather a study of future possibilities in automatic fact-checking and a showcase of the current capabilities of state-of-the-art language models. The pipeline in its current state should only be used with human supervision because of the potential biases and errors that could harm the consumers of the output information or persons mentioned in the claims. The pipeline could be misused to spread misinformation by directly using misinformation sources or by intentionally modifying the pipeline in a way that will generate wrong outputs.

Another important statement is that our pipeline was in its current form explicitly built for the \averitec{} shared task, and thus, the evaluation results reflect the bias of the annotators. For more information, see the relevant section of the original paper~\cite{averitec2024}.

The carbon costs of the training and running of our pipeline are considerable and should be taken into account given the urgency of climate change. At the time of deployment, the pipeline should be run on the smallest possible model that can still provide reliable results, and the latest hardware and software optimisations should be used to minimise the carbon footprint.

\section*{Acknowledgements}
We would like to thank Bryce Aaron from UNC for exploring the problems of search query generation and pinpointing claims of underrepresented labels using numerical methods that did not make it into our final pipeline but gave us a frame for comparison. 

This research was co-financed with state support from the Technology Agency of the Czech Republic and the Ministry of Industry and Trade of the Czech Republic under the TREND Programme, project FW10010200.
The access to the computational infrastructure of the OP VVV funded project CZ.02.1.01/0.0/0.0/16\_019/0000765 ``Research Center for Informatics'' is also gratefully acknowledged.
We would like to thank to \mbox{OpenAI} for providing free credit for their paid API via Researcher Access Program\footnote{\review{\url{https://openai.com/form/researcher-access-program/}}}.


% Entries for the entire Anthology, followed by custom entries
\bibliography{anthology,custom}
\bibliographystyle{acl_natbib}

\appendix

\include{src/appendix_a_llms}
%\include{src/appendix_b_opensource}
\include{src/appendix_c_errors}

\end{document}