Stars
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
An Open-Source Python3 tool for recognizing layouts, tables, math formulas (LaTeX), and text in images, converting them into Markdown format. A free alternative to Mathpix, empowering seamless conv…
Faker is a Python package that generates fake data for you.
similarity: Text similarity calculation Toolkit for Java. 文本相似度计算工具包,java编写,可用于文本相似度计算、情感分析等任务,开箱即用。
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
中华经典古籍精校、诗词,四书五经、四大名著、诗经、楚辞、全唐诗、全宋词、唐诗三百首、宋詞三百首、二十四史......
Similarities: a toolkit for similarity calculation and semantic search. 相似度计算、匹配搜索工具包,支持亿级数据文搜文、文搜图、图搜图,python3开发,开箱即用。
Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. We have open-sourced the model, training data, evaluation data, evaluation me…
Democratizing Internet-scale financial data.
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
一款高性能敏感词(非法词/脏字)检测过滤组件,附带繁体简体互换,支持全角半角互换,汉字转拼音,模糊搜索等功能。
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
首个llama2 13b 中文版模型 (Base + 中文对话SFT,实现流畅多轮人机自然语言交互)
Tuning LLMs with no tears💦; Sample Design Engineering (SDE) for more efficient downstream-tuning.
Train transformer language models with reinforcement learning.
Firefly中文LLaMA-2大模型,支持增量预训练Baichuan2、Llama2、Llama、Falcon、Qwen、Baichuan、InternLM、Bloom等大模型
中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Llama中文社区,Llama3在线体验和微调模型已开放,实时汇总最新Llama3学习资料,已将所有代码更新适配Llama3,构建最好的中文Llama大模型,完全开源可商用
Code and data of the CCL 2022 paper "Automatic Construction of Sentence Pattern Structure Treebank".
比较全的中华古诗古词古文库,包括21万首古诗词,以及注释、赏析等信息,包含10000多名诗人以及诗人的介绍、生平等,同时包含,1600多个词牌介绍,中国70多个朝代解析,和古诗文的近200个分类标签