Skip to content

Improve tokenizer guidance for local ONNX models#10338

Open
Notheisz57 wants to merge 1 commit into
MicrosoftDocs:livefrom
Notheisz57:patch-1
Open

Improve tokenizer guidance for local ONNX models#10338
Notheisz57 wants to merge 1 commit into
MicrosoftDocs:livefrom
Notheisz57:patch-1

Conversation

@Notheisz57
Copy link
Copy Markdown

Clarify the tokenizer-cpp export, called by SQL Server in response to a SQL query involving AI_GENERATE_EMBEDDINGS.

Intermediate build instructions are reasonably out-of-scope, but this specific export was necessary for SQL Server 2025 to successfully tokenize.

Instructions for tokenizer DLL requirements
@prmerger-automator
Copy link
Copy Markdown
Contributor

@Notheisz57 : Thanks for your contribution! The author(s) and reviewer(s) have been notified to review your proposed change.

@learn-build-service-prod
Copy link
Copy Markdown
Contributor

Learn Build status updates of commit f8993f4:

✅ Validation status: passed

File Status Preview URL Details
docs/t-sql/statements/create-external-model-transact-sql.md ✅Succeeded

For more details, please refer to the build report.

@v-regandowner
Copy link
Copy Markdown
Contributor

@JetterMcTedder

Can you review the proposed changes?

IMPORTANT: When the changes are ready for publication, adding a #sign-off comment is the best way to signal that the PR is ready for the review team to merge.

#label:"aq-pr-triaged"
@MicrosoftDocs/public-repo-pr-review-team

@prmerger-automator prmerger-automator Bot added the aq-pr-triaged tracking label for the PR review team label May 26, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the CREATE EXTERNAL MODEL documentation to clarify how to build and expose a tokenizer DLL (tokenizers-cpp) used by SQL Server when running local ONNX embedding generation (for example, via AI_GENERATE_EMBEDDINGS).

Changes:

  • Adds a C++ example showing an expected exported entry point for a tokenizer DLL.
  • Adds a note about the DLL export signature potentially changing, and reiterates the expected DLL filename.

Comment on lines +400 to +402
const std::string& json_blob, // contents of `tokenizer.json`
const std::string& text, // input text to tokenize
std::vector<int>& out_ids // output token IDs (the embeddings)
Comment on lines +391 to +402
The tokenizer must be compiled as a shared dynamic link library using MSVC, and must export a specific entry point:

```cpp
#include "tokenizers_cpp.h" // for example: `tokenizers-cpp\include\tokenizers_cpp.h`
#include <string>
#include <vector>

extern "C" __declspec(dllexport)
void LoadBlobJsonAndEncode(
const std::string& json_blob, // contents of `tokenizer.json`
const std::string& text, // input text to tokenize
std::vector<int>& out_ids // output token IDs (the embeddings)
Comment on lines 410 to +411
> [!NOTE]
> Ensure the created dll is named **tokenizers_cpp.dll**
> The exact signature of this export may change. Ensure the created dll is named **tokenizers_cpp.dll**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants