Unity/Barracuda & Sentis

Unity Sentis에서 Tokenizer 구현하기 - CLIP Byte Pair Encoding

pnltoen 2025. 5. 4.

목차

Byte Pair Encoding

Unity Technology

서문

지난 포스팅에서 MiniLM에서 어떻게 텍스트를 Tokenizer 하였는지 알아보았다. 결과적으로 Wordpiece 방식은 아니였고 주로 단어의 의미는 좌측에 있기 때문에 우측에서 부터 하나씩 줄여서 탐색하는 방식으로 Wordpiece 방식을 모방한 것으로 확인하였다.

결과적으로 텍스트를 tokenizer해서 무엇을하지 라는 생각을 하였는데 결과적으로 zero-shot 모델을 Unity Sentis 모델에서 실행하는 것을 목표로 해보고자 한다. 우선 Unity Sentis 공식 예제에서 zero-shot 그리고 BPE 샘플 자체가 없기 때문에... BPE 자체만으로도 의미가 있을 수도...?

본문

구현 하면서 가장 어려웠던 점은...사실 BPE 구현 방식에 대한 내용이었다. 기존 목표는 파이썬에서 구현된 BPE 코드를 유니티 즉 C#으로 바꿔야지 생각했는데 구글에 있는 모든 자료는 Pytorch에서 encode 함수를 사용해서 BPE Tokenizer를 할 수 있다고 기재되어 있었다.

남들 코드 한줄로 진행하는 부분을 직접 만들어야 하다보니 어려움과 자료 부족이 공존하였고 이에 포스팅으로 정리해보고자 한다.

해당 포스팅의 경우 완료된 상태에서 올리는 것이지만 구현 단계에서는 Encode 함수 작동 원리에 대한 자료가 너무 부족하여 시간이 많이 소모되었다. 뭐 다음의 방법도 생각해보았는데 참고용으로 공유한다

Onnx Runtime Extention을 사용하는 방법
- Unity에서 Onnx Runtime을 사용하는 것 자체가 GPU 사용 제한이 있어서 고려하지 않음

Releases · microsoft/onnxruntime-extensions

onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime - microsoft/onnxruntime-extensions...

github.com

Unity Python Scripting을 사용하는 방법
- 코드 관리가 어려워질 것으로 판단하여 사용하지 않음
Python Server를 이용하는 방법
- Unity Sentis의 on-device 장점이 퇴색되는 것 같아 사용하지 않음.

Github와 Hugging Face 2곳에 Clip repo가 있었는데 Github의 경우 좀 더 outdated 된 것으로 보인다. Clip 자체의 onnx export 및 Unity Sentis에서 Import가 잘 되지 않아서 Hugging Face 기준으로 진행하였다.

GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image - openai/CLIP...

github.com

Onnx 변환이 잘 되지 않던 것도 있지만 어차피 결과적으로 Vit-base-patch32 모델을 사용해야 했기 때문에 hugging face 모델을 사용하는 것에는 문제가 전혀 없었다.

Vit-base-patch32

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

여기에 보면 결과적으로 transfomers 라이브러리에서 CLIPProcessor를 import 하고 있고 가져온 CLIPProcessor가 토크나이저를 진행하게 된다. 코드로 보면 이 부분이 되게 난해하다고 느껴지고 실제 진행함에 있어서도 빼먹고 진행한 부분이 되게 많았다.

결과적으로 구현에 필요한 규칙을 정리해보고자 한다.

Byte Pair Encoding 방식

BPE 방식 자체에 대해서는 Hugging Face나 유튜브에 너무 많은 예시 및 자료가 있음으로... 간단하게 설명한다.

Byte-Pair Encoding tokenization - Hugging Face LLM Course

Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa. 💡...

huggingface.co

Vocab 및 Merge Rule 로드
- vocab.json: 토큰 문자열 ↔ 토큰 ID 매핑
- merges.txt: 병합 우선순위 정보
문장 입력 시 단어 단위로 분리
- 공백 또는 구분자 기준으로 분리
- 각 단어 뒤에 </w> 삽입하여 BPE 기준 경계 명시
각 단어가 Vocab에 존재하는지 확인
- 존재하면 그대로 사용
- 존재하지 않으면 단어를 글자 단위로 분할 → BPE Merge 적용
BPE 알고리즘을 통해 Pair Merge 반복 수행
- Merge Rule을 기반으로 우선순위 높은 쌍부터 병합
- 마지막 토큰에 </w> 재첨부 필요 시 처리
특수 토큰 삽입
- 문장 시작: <|startoftext|>
- 문장 끝: <|endoftext|>
Dynamic Padding 수행
- 입력된 문장들 중 가장 긴 토큰 길이 기준으로 패딩
- 부족한 길이는 <|endoftext|>로 패딩
Attention Mask 생성 (Optional)
- 실제 토큰: 1, 패딩: 0
- 모델이 이를 입력으로 요구할 경우에만 생성
L2 정규화 처리 (Optional)
- 일부 CLIP 모델 Export 포맷에 따라 임베딩 결과에 L2 정규화 적용 필요
- 예: embedding / ||embedding||

이제 각 단계에 따라 코드를 정리해보고자 한다. 모든 내용을 정리하기 보다는 실제 중요한 부분을 정리할 예정

Vocab 및 Merge Rule 로드

    void LoadPretrainedData()
    {
        string vocabPath = Application.streamingAssetsPath + "/BPE_vocab.json";
        if (!File.Exists(vocabPath)) return;
        string jsonText = File.ReadAllText(vocabPath);
        _vocab = JsonConvert.DeserializeObject<Dictionary<string, int>>(jsonText);

        string mergesPath = Application.streamingAssetsPath + "/BPE_merges.txt";
        if (!File.Exists(mergesPath)) return;
        string[] lines = File.ReadAllLines(mergesPath);
        int index = 0;
        foreach (string line in lines)
        {
            string[] parts = line.Split(' ');
            if (parts.Length >= 2)
                _mergeRules[(parts[0], parts[1])] = index++;
        }
    }

Json과 text 파일을 불러온다. 이 2개의 파일의 경우에는 별도로 저장해야 할 정보는 없다.

하지만 merge rules의 경우 merge 우선 순위가 있기 때문에 index 및 index++를 이용하여 각 merge rule 우선 순위를 tuple에 적용한다.

문장 분리 및 BPE 알고리즘 적용

    public (List<List<int>>, int, int) TokenizeInput(string input)
    {
        _inputPhrases = input.Split(',');
        List<List<string>> phraseTokens = new List<List<string>>();

        foreach (string phrase in _inputPhrases)
        {
            string cleaned = phrase.Trim().ToLower().Replace("-", " ").Replace("_", " ").Replace(" ", "</w> ");
            string[] words = cleaned.Split(' ');
            if (words.Length > 0 && !words[words.Length - 1].EndsWith("</w>"))
                words[words.Length - 1] += "</w>";

            List<string> tokens = new List<string>();
            foreach (string word in words)
                tokens.AddRange(TokenizeWord(word));

            tokens.Insert(0, "<|startoftext|>");
            tokens.Add("<|endoftext|>");
            phraseTokens.Add(tokens);
        }

        _maxTokenLength = phraseTokens.Max(list => list.Count);
        _numPhrases = phraseTokens.Count;

        foreach (var tokenList in phraseTokens)
        {
            int padCount = _maxTokenLength - tokenList.Count;
            for (int j = 0; j < padCount; j++)
                tokenList.Add("<|endoftext|>");
        }

        _tokenIdLists = phraseTokens.Select(list => ConvertToTokenIds(list)).ToList();
        return (_tokenIdLists, _numPhrases, _maxTokenLength);
    }

TokenizeInput에서는 Tokenize를 위한 전 후 처리를 담당한다.

Endswith를 활용하여 words[words.Length -1] 즉 마지막 단어에 공백 토큰을 추가한다.

실제 Tokenize의 경우 TokenizeWord에서 아래와 같이 진행한다.

    List<string> TokenizeWord(string word)
    {
        bool endsWithW = word.EndsWith("</w>");
        string baseWord = endsWithW ? word.Substring(0, word.Length - 4) : word;

        if (_vocab.ContainsKey(word)) return new List<string> { word };

        List<string> splitChars = baseWord.Select(c => c.ToString()).ToList();
        List<string> merged = ApplyBPE(splitChars);
        if (endsWithW && merged.Count > 0)
            merged[merged.Count - 1] += "</w>";
        return merged;
    }

전체 단어가 Vocab에 있는 경우 바로 등록하고 Vocab에 없는 경우 단어 분할 및 BPE 알고리즘을 적용한다.

    public List<string> ApplyBPE(List<string> tokens)
    {
        while (tokens.Count > 1)
        {
            Dictionary<(string, string), int> candidates = new Dictionary<(string, string), int>();
            for (int i = 0; i < tokens.Count - 1; i++)
            {
                var pair = (tokens[i], tokens[i + 1]);
                if (_mergeRules.TryGetValue(pair, out int rank))
                    candidates[pair] = rank;
            }
            if (candidates.Count == 0) break;

            var best = candidates.OrderBy(p => p.Value).First().Key;
            string merged = best.Item1 + best.Item2;
            List<string> result = new List<string>();

            for (int i = 0; i < tokens.Count; i++)
            {
                if (i < tokens.Count - 1 && tokens[i] == best.Item1 && tokens[i + 1] == best.Item2)
                {
                    result.Add(merged);
                    i++;
                }
                else
                {
                    result.Add(tokens[i]);
                }
            }
            if (tokens.SequenceEqual(result)) break;
            tokens = result;
        }
        return tokens;
    }

tuple을 key로 쓰는 dictionary를 사용하여 각 단어 절과 우선순위를 저장한다.

이 후 오름차순으로 정리 후 가장 많이 merged된 것을 합쳐준다 혹시 merge가 쌍이 여러개 되는 경우가 있으니 i++로 하였다.

마지막으로 SequenceEqual을 활용하여 처음 input과 result가 같으면 즉 더 이상 병합할 부분이 없으면 넘어간다.

Attention Mask 생성 및 Post Processes

최종적으로 나는 Clip을 위해 BPE를 사용하였는데 모델에 따라 Attention mask가 필요한 경우도 있고 또는 후처리를 따로 해줘야 하는 경우도 있다 (softmax 또는 l2 norm 등...)

다 해봤는데 결과적으로 말하면 그냥 왠만하면 최대한 모델은 원본 그대로 가져오는게 좋았다. 괜히 L2 Norm 따로 적용하는 예제 만드려고 했다가 결과 값이 달라져서 고생하였음..

결과적으로 위의 코드에서 계속 비교를 위해 이중 배열을 사용하고 있고 다음과 같이 Clip은 Attention Mask도 input으로 만들어주니 flat 및 Attention Mask를 둘다 진행해야 한다.

Unity Sentis에서 Tokenizer 구현하기 - CLIP Byte Pair Encoding - Byte Pair Encoding - 본문 - Attention Mask 생성 및 Post Processes

따라서 모델 생성은 다음과 같이 진행된다.

    void CreateClipModel(List<List<int>> tokens)
    {
        _clipModel = modelAsset != null
            ? ModelLoader.Load(modelAsset)
            : ModelLoader.Load(Application.streamingAssetsPath + "/clip_full.sentis");

        UpdateTokenArrays(tokens);

        var graph = new FunctionalGraph();
        var inputIds = new Tensor<int>(new TensorShape(_numPhrases, MaxTokenLength), _tokenArray);
        var inputMask = new Tensor<int>(new TensorShape(_numPhrases, MaxTokenLength), _attentionMaskArray);

        var idConst = FF.Constant(inputIds);
        var maskConst = FF.Constant(inputMask);
        var imageInput = graph.AddInput(DataType.Float, new TensorShape(1, 3, _imageWidth, _imageHeight));

        var outputs = FF.Forward(_clipModel, idConst, maskConst, imageInput);
        var softmax = FF.Softmax(outputs[2][0]);

        Model finalModel = graph.Compile(softmax);
        _clipWorker = new Worker(finalModel, Backend);
    }

graph로 functional을 사용하고, array를 미리 만들어서 각각 inputIds 그리고 inputMask로 사용하는 개념으로 이해하면 된다.

Sentis의 경우 Forward의 return 형태가 이중 배열로 되어 있어서 [][] 와 같은 형태 outputs[2][0]로 작성해야 한다.

이 부분 처음에 볼때 매우 헷갈렸음...

결과적으로 보면 다음과 같이 출력되는 것을 볼 수 있다. 이미지 임베딩은 너무 간단해서 패스...

검증

가장 중요한 Python과의 검증 시간... 과연 정말 Hugging Face의 공식 BPE 그리고 Clip vit-base-patch32와 결과 값이 같을까?

우선 파이썬에서 진행해보면 결과가 다음과 같다.

Unity Sentis에서 Tokenizer 구현하기 - CLIP Byte Pair Encoding - Byte Pair Encoding - 모든 영역

실제 유니티에서 진행해봐도 다음과 같이 동일한 결과를 얻을 수 있었다.

그렇다면 전체 Clip 모델과 비교해봐도 결과 값이 같을까? 공식 예제인 "This is a happy person" 그리고 "This is a happy dog"로 비교해보자

결과적으로 같은 예시로 했을 때 0.6945773의 동일한 결과를 얻을 수 있었다.

사실 Transfomers의 전체 Encode 함수는 너무 복잡해서... 계속 결과값 비교하면서 만들었는데 잘 작동해서 다행이다.

BPE의 경우 결과적으로 여러 모델에서 아직도 약간의 수정된 상태로 쓰고 있어서 학습된 vocab, merges 파일만 교체해주면 사용성이 뛰어날 것 같다. 비쥬얼적으로 보이는 무언가는 없지만 ㅜ

저작자표시 비영리 변경금지 (새창열림)

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Unity Sentis에서 Tokenizer 구현하기 - CLIP Byte Pair Encoding

Byte Pair Encoding

서문

본문

Vit-base-patch32

Byte Pair Encoding 방식

Vocab 및 Merge Rule 로드

문장 분리 및 BPE 알고리즘 적용

Attention Mask 생성 및 Post Processes

검증

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역