Skip to content

C++ split Chinese string 分割中文 #27

Open
@Shellbye

Description

@Shellbye

相比于方便快捷的Python,C++的常用操作确实是匮乏很多,最近项目需要分割中文字符串,我这个C++新手在网上找了好长时间都没有结果,最后还是歪打正着的找到了这个SO的问答,才有了眉目。

#include <iostream>
#include <string>
#include <vector>

std::vector<std::string> split_chinese(std::string s) {
    std::vector<std::string> t;
    for (size_t i = 0; i < s.length();)
    {
        int cplen = 1;
        // 以下的几个if,要参考这里 https://en.wikipedia.org/wiki/UTF-8#Description
        if ((s[i] & 0xf8) == 0xf0)      // 11111000, 11110000
            cplen = 4;
        else if ((s[i] & 0xf0) == 0xe0) // 11100000
            cplen = 3;
        else if ((s[i] & 0xe0) == 0xc0) // 11000000
            cplen = 2;
        if ((i + cplen) > s.length())
            cplen = 1;
        t.push_back(s.substr(i, cplen));
        i += cplen;
    }
    return t;
}

int main(int argc, char *argv[])
{
    std::string s = "这是一组中文";
    std::vector<std::string> t = split_chinese(s);
    for(auto a : t) {
        std::cout << a << std::endl;
    }
    return 0;
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions