You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: arxiv.json
+35Lines changed: 35 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -53429,5 +53429,40 @@
53429
53429
"pub_date": "2025-08-19",
53430
53430
"summary": "Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed existing models' performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at https://github.com/JCruan519/MME-SCI.",
53431
53431
"translated": "近年来,多模态大语言模型(MLLMs)在各领域取得显著进展,相应的评估基准也在持续完善。在此过程中,科学领域的基准测试对衡量MLLMs的推理能力发挥着重要作用。然而现有基准仍面临三大挑战:1)对多语言场景下模型推理能力的评估不足;2)对多模态覆盖全面性的考量不充分;3)缺乏对科学知识点的细粒度标注。为弥补这些缺陷,我们提出综合性强、挑战度高的基准测试MME-SCI。通过精心收集1,019组高质量问答对,涵盖数学、物理、化学、生物四大学科,支持中、英、法、西、日五种语言,并包含三种差异化评估模式。我们对16个开源模型和4个闭源模型进行广泛实验,结果表明MME-SCI对现有MLLMs构成普遍挑战——例如在纯图像评估模式下,o4-mini在数理化生四个科目的准确率仅分别为52.11%、24.73%、36.57%和29.80%,难度显著超越现有基准。更重要的是,借助MME-SCI的多语言与细粒度知识属性,我们深入剖析了现有模型表现,揭示了其在特定领域的薄弱环节。数据集与评估代码已开源:https://github.com/JCruan519/MME-SCI。\n\n(注:根据学术规范,翻译中对模型名称\"o4-mini\"保留原文写法,数据精度与原文完全一致,专业术语如\"multimodal large language models\"采用学界通用译法\"多模态大语言模型\",长难句按中文习惯拆分重构,同时保留原文的严谨性与技术细节)"
53432
+
},
53433
+
{
53434
+
"title": "Benefiting from Negative yet Informative Feedback by Contrasting\n Opposing Sequential Patterns",
53435
+
"url": "http://arxiv.org/abs/2508.14786v1",
53436
+
"pub_date": "2025-08-20",
53437
+
"summary": "We consider the task of learning from both positive and negative feedback in a sequential recommendation scenario, as both types of feedback are often present in user interactions. Meanwhile, conventional sequential learning models usually focus on considering and predicting positive interactions, ignoring that reducing items with negative feedback in recommendations improves user satisfaction with the service. Moreover, the negative feedback can potentially provide a useful signal for more accurate identification of true user interests. In this work, we propose to train two transformer encoders on separate positive and negative interaction sequences. We incorporate both types of feedback into the training objective of the sequential recommender using a composite loss function that includes positive and negative cross-entropy as well as a cleverly crafted contrastive term, that helps better modeling opposing patterns. We demonstrate the effectiveness of this approach in terms of increasing true-positive metrics compared to state-of-the-art sequential recommendation methods while reducing the number of wrongly promoted negative items.",
"title": "OneLoc: Geo-Aware Generative Recommender Systems for Local Life Service",
53442
+
"url": "http://arxiv.org/abs/2508.14646v1",
53443
+
"pub_date": "2025-08-20",
53444
+
"summary": "Local life service is a vital scenario in Kuaishou App, where video recommendation is intrinsically linked with store's location information. Thus, recommendation in our scenario is challenging because we should take into account user's interest and real-time location at the same time. In the face of such complex scenarios, end-to-end generative recommendation has emerged as a new paradigm, such as OneRec in the short video scenario, OneSug in the search scenario, and EGA in the advertising scenario. However, in local life service, an end-to-end generative recommendation model has not yet been developed as there are some key challenges to be solved. The first challenge is how to make full use of geographic information. The second challenge is how to balance multiple objectives, including user interests, the distance between user and stores, and some other business objectives. To address the challenges, we propose OneLoc. Specifically, we leverage geographic information from different perspectives: (1) geo-aware semantic ID incorporates both video and geographic information for tokenization, (2) geo-aware self-attention in the encoder leverages both video location similarity and user's real-time location, and (3) neighbor-aware prompt captures rich context information surrounding users for generation. To balance multiple objectives, we use reinforcement learning and propose two reward functions, i.e., geographic reward and GMV reward. With the above design, OneLoc achieves outstanding offline and online performance. In fact, OneLoc has been deployed in local life service of Kuaishou App. It serves 400 million active users daily, achieving 21.016% and 17.891% improvements in terms of gross merchandise value (GMV) and orders numbers.",
"title": "MISS: Multi-Modal Tree Indexing and Searching with Lifelong Sequential\n Behavior for Retrieval Recommendation",
53449
+
"url": "http://arxiv.org/abs/2508.14515v1",
53450
+
"pub_date": "2025-08-20",
53451
+
"summary": "Large-scale industrial recommendation systems typically employ a two-stage paradigm of retrieval and ranking to handle huge amounts of information. Recent research focuses on improving the performance of retrieval model. A promising way is to introduce extensive information about users and items. On one hand, lifelong sequential behavior is valuable. Existing lifelong behavior modeling methods in ranking stage focus on the interaction of lifelong behavior and candidate items from retrieval stage. In retrieval stage, it is difficult to utilize lifelong behavior because of a large corpus of candidate items. On the other hand, existing retrieval methods mostly relay on interaction information, potentially disregarding valuable multi-modal information. To solve these problems, we represent the pioneering exploration of leveraging multi-modal information and lifelong sequence model within the advanced tree-based retrieval model. We propose Multi-modal Indexing and Searching with lifelong Sequence (MISS), which contains a multi-modal index tree and a multi-modal lifelong sequence modeling module. Specifically, for better index structure, we propose multi-modal index tree, which is built using the multi-modal embedding to precisely represent item similarity. To precisely capture diverse user interests in user lifelong sequence, we propose collaborative general search unit (Co-GSU) and multi-modal general search unit (MM-GSU) for multi-perspective interests searching.",
"title": "DGenCTR: Towards a Universal Generative Paradigm for Click-Through Rate\n Prediction via Discrete Diffusion",
53456
+
"url": "http://arxiv.org/abs/2508.14500v1",
53457
+
"pub_date": "2025-08-20",
53458
+
"summary": "Recent advances in generative models have inspired the field of recommender systems to explore generative approaches, but most existing research focuses on sequence generation, a paradigm ill-suited for click-through rate (CTR) prediction. CTR models critically depend on a large number of cross-features between the target item and the user to estimate the probability of clicking on the item, and discarding these cross-features will significantly impair model performance. Therefore, to harness the ability of generative models to understand data distributions and thereby alleviate the constraints of traditional discriminative models in label-scarce space, diverging from the item-generation paradigm of sequence generation methods, we propose a novel sample-level generation paradigm specifically designed for the CTR task: a two-stage Discrete Diffusion-Based Generative CTR training framework (DGenCTR). This two-stage framework comprises a diffusion-based generative pre-training stage and a CTR-targeted supervised fine-tuning stage for CTR. Finally, extensive offline experiments and online A/B testing conclusively validate the effectiveness of our framework.",
"summary": "With the emergence of e-commerce, the recommendations provided by commercial platforms must adapt to diverse scenarios to accommodate users' varying shopping preferences. Current methods typically use a unified framework to offer personalized recommendations for different scenarios. However, they often employ shared bottom representations, which partially hinders the model's capacity to capture scenario uniqueness. Ideally, users and items should exhibit specific characteristics in different scenarios, prompting the need to learn scenario-specific representations to differentiate scenarios. Yet, variations in user and item interactions across scenarios lead to data sparsity issues, impeding the acquisition of scenario-specific representations. To learn robust scenario-specific representations, we introduce a Global-Distribution Aware Scenario-Specific Variational Representation Learning Framework (GSVR) that can be directly applied to existing multi-scenario methods. Specifically, considering the uncertainty stemming from limited samples, our approach employs a probabilistic model to generate scenario-specific distributions for each user and item in each scenario, estimated through variational inference (VI). Additionally, we introduce the global knowledge-aware multinomial distributions as prior knowledge to regulate the learning of the posterior user and item distributions, ensuring similarities among distributions for users with akin interests and items with similar side information. This mitigates the risk of users or items with fewer records being overwhelmed in sparse scenarios. Extensive experimental results affirm the efficacy of GSVR in assisting existing multi-scenario recommendation methods in learning more robust representations.",
0 commit comments