From f8993f4208674ae981714dc2e286098bc5e1a520 Mon Sep 17 00:00:00 2001 From: Notheisz57 <17867971+Notheisz57@users.noreply.github.com> Date: Mon, 25 May 2026 22:14:33 -0700 Subject: [PATCH] Improve tokenizer DLL guidance Instructions for tokenizer DLL requirements --- .../create-external-model-transact-sql.md | 21 ++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/docs/t-sql/statements/create-external-model-transact-sql.md b/docs/t-sql/statements/create-external-model-transact-sql.md index 743a5c038d1..8a52448d78f 100644 --- a/docs/t-sql/statements/create-external-model-transact-sql.md +++ b/docs/t-sql/statements/create-external-model-transact-sql.md @@ -388,8 +388,27 @@ Next, download a version of [ONNX Runtime](https://github.com/microsoft/onnxrunt Download and build [the `tokenizers-cpp` library](https://github.com/mlc-ai/tokenizers-cpp/tree/main) from GitHub. Once the dll is created, place the tokenizer in the `C:\onnx_runtime` directory. +The tokenizer must be compiled as a shared dynamic link library using MSVC, and must export a specific entry point: + +```cpp +#include "tokenizers_cpp.h" // for example: `tokenizers-cpp\include\tokenizers_cpp.h` +#include +#include + +extern "C" __declspec(dllexport) +void LoadBlobJsonAndEncode( + const std::string& json_blob, // contents of `tokenizer.json` + const std::string& text, // input text to tokenize + std::vector& out_ids // output token IDs (the embeddings) +) { + // ~~ Implement according to current API of `tokenizers-cpp` ~~ + // auto tok = tokenizers::Tokenizer::FromBlobJSON(json_blob); + // out_ids = tok->Encode(text); +} +``` + > [!NOTE] -> Ensure the created dll is named **tokenizers_cpp.dll** +> The exact signature of this export may change. Ensure the created dll is named **tokenizers_cpp.dll** ### Step 5: Download the ONNX model