Understanding the kernel in Semantic Kernel | Microsoft Learn #679

irthomasthomas · 2024-03-04T12:33:56Z

Understanding the kernel in Semantic Kernel | Microsoft Learn

Understanding the kernel in Semantic Kernel | Microsoft Learn

DESCRIPTION:
"Understanding the kernel in Semantic Kernel
Article
02/22/2024
2 contributors
In this article

The kernel is at the center of everything
Building a kernel
Invoking plugins from the kernel
Going further with the kernel
Next steps

Similar to operating system, the kernel is responsible for managing resources that are necessary to run "code" in an AI application. This includes managing the AI models, services, and plugins that are necessary for both native code and AI services to run together.

If you want to see the code demonstrated in this article in a complete solution, check out the following samples in the public documentation repository.

Language | Link to final solution
C# | Open example in GitHub
Python | Open solution in GitHub

The kernel is at the center of everything
Because the kernel has all of the services and plugins necessary to run both native code and AI services, it is used by nearly every component within the Semantic Kernel SDK. This means that if you run any prompt or code in Semantic Kernel, it will always go through a kernel.

This is extremely powerful, because it means you as a developer have a single place where you can configure, and most importantly monitor, your AI application. Take for example, when you invoke a prompt from the kernel. When you do so, the kernel will...

Select the best AI service to run the prompt.
Build the prompt using the provided prompt template.
Send the prompt to the AI service.
Receive and parse the response.
Before finally returning the response to your application.

Throughout this entire process, you can create events and middleware that are triggered at each of these steps. This means you can perform actions like logging, provide status updates to users, and most importantly responsible AI. All from a single place.

Building a kernel
Before building a kernel, you should first understand the two types of components that exist within a kernel: services and plugins. Services consist of both AI services and other services that are necessary to run your application (e.g., logging, telemetry, etc.). Plugins, meanwhile, are any code you want AI to call or leverage within a prompt.

In the following examples, you can see how to add a logger, chat completion service, and plugin to the kernel.

C#
Python
Import the necessary packages:

import semantic_kernel as sk  
from semantic_kernel.connectors.ai.open_ai import (  
    OpenAIChatCompletion,  
    AzureChatCompletion,  
)

If you are using a Azure OpenAI, you can use the AzureChatCompletion class.

kernel.add_chat_service(  
    "chat_completion",  
    AzureChatCompletion(  
        config.get("AZURE_OPEN_AI__CHAT_COMPLETION_DEPLOYMENT_NAME", None),  
        config.get("AZURE_OPEN_AI__ENDPOINT", None),  
        config.get("AZURE_OPEN_AI__API_KEY", None),  
    ),

If you are using OpenAI, you can use the OpenAIChatCompletion class.

kernel.add_chat_service(  
    "chat_completion",  
    OpenAIChatCompletion(  
        config.get("OPEN_AI__CHAT_COMPLETION_MODEL_ID", None),  
        config.get("OPEN_AI__API_KEY", None),  
        config.get("OPEN_AI__ORG_ID", None),  
    ),  
)  

**Invoking plugins from the kernel**  
Semantic Kernel makes it easy to run prompts alongside native code because they are both expressed as KernelFunctions. This means you can invoke them in exactly same way.  
To run KernelFunctions, Semantic Kernel provides the InvokeAsync method. Simply pass in the function you want to run, its arguments, and the kernel will handle the rest.  

**C#**  
**Python**  
**Import the necessary packages:**  
```python  
from semantic_kernel.core_skills import TimeSkill

Run the today function from the time plugin:

# Import the TimeSkill  
time = kernel.import_skill(TimeSkill())  

# Run the current time function  
currentTime = await kernel.run_async(time["today"])  
print(currentTime)

Run the ShortPoem function from WriterPlugin while using the current time as an argument:

# Import the WriterPlugin from the plugins directory.  
plugins_directory = "./plugins"  
writer_plugin = kernel.import_semantic_skill_from_directory(  
    plugins_directory, "WriterPlugin"  
)  

# Run the short poem function  
poemResult = await kernel.run_async(  
    writer_plugin["ShortPoem"], input_str=str(currentTime)  
)  
print(poemResult)  

**Going further with the kernel**  
For more details on how to configure and leverage these properties, please refer to the following articles:  

**Article** | **Description**  
Adding AI services | Learn how to add additional AI services from OpenAI, Azure OpenAI, Hugging Face, and more to the kernel.  
Adding telemetry and logs | Gain visibility into what Semantic Kernel is doing by adding telemetry to the kernel.  

**URL:** [https://learn.microsoft.com/en-us/semantic-kernel/agents/kernel/?tabs=python](https://learn.microsoft.com/en-us/semantic-kernel/agents/kernel/?tabs=python)

#### Suggested labels
#### {'label-name': 'kernel-management', 'label-description': 'Covering topics related to managing resources and services in the Semantic Kernel SDK.', 'confidence': 58.43}

irthomasthomas · 2024-03-04T12:33:58Z

Related issues

#396: astra-assistants-api: A backend implementation of the OpenAI beta Assistants API

### Details

Similarity score: 0.87 - [ ] [datastax/astra-assistants-api: A backend implementation of the OpenAI beta Assistants API](https://github.com/datastax/astra-assistants-api)

Astra Assistant API Service

A drop-in compatible service for the OpenAI beta Assistants API with support for persistent threads, files, assistants, messages, retrieval, function calling and more using AstraDB (DataStax's db as a service offering powered by Apache Cassandra and jvector).

Compatible with existing OpenAI apps via the OpenAI SDKs by changing a single line of code.

Getting Started

Create an Astra DB Vector database
Replace the following code:

client = OpenAI(
    api_key=OPENAI_API_KEY,
)

with:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key=OPENAI_API_KEY,
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
    }
)

Or, if you have an existing astra db, you can pass your db_id in a second header:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key=OPENAI_API_KEY,
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
        "astra-db-id": ASTRA_DB_ID
    }
)

Create an assistant

assistant = client.beta.assistants.create(
  instructions="You are a personal math tutor. When asked a math question, write and run code to answer the question.",
  model="gpt-4-1106-preview",
  tools=[{"type": "retrieval"}]
)

By default, the service uses AstraDB as the database/vector store and OpenAI for embeddings and chat completion.

Third party LLM Support

We now support many third party models for both embeddings and completion thanks to litellm. Pass the api key of your service using api-key and embedding-model headers.

For AWS Bedrock, you can pass additional custom headers:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key="NONE",
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
        "embedding-model": "amazon.titan-embed-text-v1",
        "LLM-PARAM-aws-access-key-id": BEDROCK_AWS_ACCESS_KEY_ID,
        "LLM-PARAM-aws-secret-access-key": BEDROCK_AWS_SECRET_ACCESS_KEY,
        "LLM-PARAM-aws-region-name": BEDROCK_AWS_REGION,
    }
)

and again, specify the custom model for the assistant.

assistant = client.beta.assistants.create(
    name="Math Tutor",
    instructions="You are a personal math tutor. Answer questions briefly, in a sentence or less.",
    model="meta.llama2-13b-chat-v1",
)

Additional examples including third party LLMs (bedrock, cohere, perplexity, etc.) can be found under examples.

To run the examples using poetry:

Create a .env file in this directory with your secrets.
Run:

poetry install
poetry run python examples/completion/basic.py
poetry run python examples/retreival/basic.py
poetry run python examples/function-calling/basic.py

Coverage

See our coverage report here.

Roadmap

Support for other embedding models and LLMs
Function calling
Pluggable RAG strategies
Streaming support

Suggested labels

{ "key": "llm-function-calling", "value": "Integration of function calling with Large Language Models (LLMs)" }

#499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.

### Details

Similarity score: 0.86 - [ ] [marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq)

CTransformers

![Build and Test](https://github.com/ marella / ctransformers / actions / workflows / build.yml / badge.svg)

Python bindings for the Transformer models implemented in C/C++ using GGML library. Also see ChatDocs

Supported Models

Model	Model Type	CUDA	Metal
GPT-2	gpt2
GPT-J, GPT4All-J	gptj
GPT-NeoX, StableLM	gpt_neox
Falcon	falcon	✅
LLaMA, LLaMA 2	llamai	✅	✅
MPT	mpt	✅
StarCoder, StarChat	gpt_bigcode	✅
Dolly V2	dolly-v2
Replit	replit

Installation

To install via pip, simply run:

pip install ctransformers

Usage

It provides a unified interface for all models:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")

print(llm("AI is going to"))

Run in Google Colab

To stream the output:

for text in llm("AI is going to", stream=True):
    print(text, end="", flush=True)

You can load models from Hugging Face Hub directly:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

If a model repo has multiple model files (.bin or .gguf files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

🤗 Transformers

Note: This is an experimental feature and may change in the future.

To use with 🤗 Transformers, create the model and tokenizer using:

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

Run in Google Colab

You can use 🤗 Transformers text generation pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))

You can use 🤗 Transformers generation parameters:

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

You can use 🤗 Transformers tokenizers:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)  # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load tokenizer from original model repo.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

Run in Google Colab

CUDA

Install CUDA libraries using:

pip install ctransformers[cuda]

ROCm

To enable ROCm support, install the ctransformers package using:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

Metal

To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

GPTQ

Note: This is an experimental feature and only LLaMA models are supported using [ExLlama](https
://github.com/TheLastBen/exllama).

Install additional dependencies using:

pip install ctransformers[gptq]

Load a GPTQ model using:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

Run in Google Colab

If the model name or path doesn't contain the word gptq, specify model_type="gptq".

It can also be used with LangChain. Low-level APIs are not fully supported.

Documentation

Find the documentation on Read the Docs.

Config

Parameter	Type	Description	Default
`top_k`	`int`	The top-k value to use for sampling	`40`
`top_p`	`float`	The top-p value to use for sampling	`0.95`
`temperature`	`float`	The temperature to use for sampling	`0.8`
`repetition_penalty`	`float`	The repetition penalty to use for sampling	`1.1`
`last_n_tokens`	`int`	The number of last tokens to use for repetition penalty	`64`
`seed`	`int`	The seed value to use for sampling tokens	`-1`
`max_new_tokens`	`int`	The maximum number of new tokens to generate	`256`
`stop`	`List`	A list of sequences to stop generation when encountered	`None`
`stream`	`bool`	Whether to stream the generated text	`False`
`reset`	`bool`	Whether to reset the model state before generating text	`True`
`batch_size`	`int`	The batch size to use for evaluating tokens in a single prompt	`8`
`threads`	`int`	The number of threads to use for evaluating tokens	`-1`
`context_length`	`int`	The maximum context length to use	`-1`
`gpu_layers`	`int`	The number of layers to run on GPU	`0`

Find the URL for the model card for GPTQ here.

Made with ❤️ by marella

Suggested labels

null

#367: Working Pen.el LSP server. An early demo - An AI overlay for everything -- async and parallelised : r/emacs

### Details

Similarity score: 0.85 - [ ] [Working Pen.el LSP server. An early demo - An AI overlay for everything -- async and parallelised : r/emacs](https://www.reddit.com/r/emacs/comments/rr7u8o/working_penel_lsp_server_an_early_demo_an_ai/)

Working Pen.el LSP server. An early demo - An AI overlay for everything -- async and parallelised

So the idea is that no matter what you're doing the LSP server generates documentation and refactoring tools for whatever you're looking at -- whether you're surfing the web through emacs, looking at photos, talking to people, or writing code. And should also be able to plug this into other editors. One LSP server for everything. So far you can only try it out from within the docker container. Start pen, open a text file, python, elisp, etc. and run lsp. Anyway, I'm going to continue working on this into next year.

https://semiosis.github.io/posts/a-prompting-lsp-server-for-any-language-or-context-using-large-language-models/
Sort by:

Add a Comment
arthurno1
•
2 yr. ago
•
Edited 2 yr. ago
One LSP server for everything.
So you have put Helm, Company and lots of other stuff behind LSP protocol? That I think is a very interesting idea. How is it feed back into Emacs, and how do we use it back in Emacs? I saw one of the demos, seems like a filtered web search from a prompt (one with sigint).

What data is sent to those services, and why are they needed?

echo "sk-" > ~/.pen/openai_api_key # https://openai.com/
echo "" > ~/.pen/ai21_api_key # https://www.ai21.com/
echo "" > ~/.pen/hf_api_key # https://huggingface.co/
echo "" > ~/.pen/nlpcloud_api_key # https://nlpcloud.io/
echo "" > ~/.pen/alephalpha_api_key # https://aleph-alpha.de/
echo "" > ~/.pen/cohere_api_key # https://cohere.ai/

Reply
reply
[deleted]
•
2 yr. ago
Does it work with eglot?

Reply
reply
emax-gomax
•
2 yr. ago
Crap. We're only a few more generations from my Emacs becoming sentient. (╬ಠ益ಠ)

Reply
reply
polaris64
•
2 yr. ago
Oh no... time to disable evil-mode then I guess...

Suggested labels

{ "key": "llm-overlay", "value": "Using Large Language Models to generate documentation and refactoring tools for various activities within Emacs and other editors" }

#369: "You are a helpful AI assistant" : r/LocalLLaMA

### Details

Similarity score: 0.85 - [ ] ["You are a helpful AI assistant" : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/18j59g1/you_are_a_helpful_ai_assistant/?share_id=g_M0-7C_zvS88BCd6M_sI&utm_content=1&utm_medium=android_app&utm_name=androidcss&utm_source=share&utm_term=1)

"You are a helpful AI assistant"

Discussion
I've been stumbling around this sub for awhile, testing all the small models and preaching the good word of the omnipotent OpenHermes. Here's some system prompt tips I've picked up:

Don't say "don't": this confuses them, which makes sense when you understand how they "think". They do their best to string concepts together, but they simply generate the next word in the sequence from the context available. Saying "don't" will put everything following that word into the equation for the following words. This can cause it to use the words and concepts you're telling it not to.
Alternative: try to use "Only" statements. Instead of "Don't talk about any other baseball team besides the New York Yankees" say "Only talk about the New York Yankees".
CAPITALIZING INSTRUCTIONS: For some reason, this works when used sparingly, it even makes some models pay attention to "don't". Surprisingly, this seems to work with even ChatGPT. It can quickly devolve your system prompt into confused yelling if you don't limit it, and can even cause your model to match the format and respond with confused yelling, so really only once or twice on important concepts.
\n: A well formated system prompt goes a long way. Splitting up different sections with a line break makes a noticeable improvement in comprehension of the system prompt by the model. For example, here is my format for LMStudio:
" Here is some information about the user: (My bio)

(system prompts)

Here is some context for the conversation: (Paste in relevant info such as web pages, documentation, etc, as well as bits of the convo you want to keep in context. When you hit the context limit, you can restart the chat and continue with the same context).

"You are a helpful AI assistant" : this is the demo system prompt to just get agreeable answers from any model. The issue with this is, once again, how they "think". The models can't conceptualize what is helpful beyond agreeing with and encouraging you. This kind of statement can lead to them making up data and concepts in order to agree with you. This is extra fun because you may not realize the problem until you discover for yourself the falacy of your own logic.
Think it through/Go over your work: This works, but I think it works because it directs attention to the prompt and response. Personally, I think there's better ways to do this.
Role assignment: telling it to act as this character or in that role is obviously necessary in some or even most instances, but this can also be limiting. It will act as that character, with all the limits and falacies of that character. If your waifu can't code, neither will your AI.
Telling it to be confident: This is a great way to circumvent the above problem, but also runs the risk of confident hallucinations. Here's a 2 prompt trick I use:
Tell one assistant to not answer the user prompt, but to simply generate a list of facts, libraries, or research points from its own data that can be helpful to answering the prompt. The prompt will be answered by the same model LLM, so write the list with the same model LLM as the future intended audience instead of a human.

Then pass the list to your assistant you intend to chat with with something like "you can confidently answer in these subjects that you are an expert in: (the list).

The point of this ^ is to limit its responses to what it actually knows, but make it confidentially answer with the information it's sure about. This has been incredibly useful in my cases, but absolutely check their work.

Suggested labels

{ "key": "sparse-computation", "value": "Optimizing large language models using sparse computation techniques" }

#418: openchat/openchat-3.5-1210 · Hugging Face

### Details

Similarity score: 0.85 - [ ] [openchat/openchat-3.5-1210 · Hugging Face](https://huggingface.co/openchat/openchat-3.5-1210#conversation-templates)

Using the OpenChat Model

We highly recommend installing the OpenChat package and using the OpenChat OpenAI-compatible API server for an optimal experience. The server is optimized for high-throughput deployment using vLLM and can run on a consumer GPU with 24GB RAM.

Installation Guide: Follow the installation guide in our repository.
Serving: Use the OpenChat OpenAI-compatible API server by running the serving command from the table below. To enable tensor parallelism, append --tensor-parallel-size N to the serving command.

Model Size Context Weights Serving

OpenChat 3.5 1210 7B 8192 python -m ochat.serving.openai_api_server --model openchat/openchat-3.5-1210 --engine-use-ray --worker-use-ray

API Usage: Once started, the server listens at localhost:18888 for requests and is compatible with the OpenAI ChatCompletion API specifications. Here's an example request:

curl http://localhost:18888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "openchat_3.5",
        "messages": [{"role": "user", "content": "You are a large language model named OpenChat. Write a poem to describe yourself"}]
      }'

Web UI: Use the OpenChat Web UI for a user-friendly experience.

Online Deployment

If you want to deploy the server as an online service, use the following options:

--api-keys sk-KEY1 sk-KEY2 ... to specify allowed API keys
--disable-log-requests --disable-log-stats --log-file openchat.log for logging only to a file.

For security purposes, we recommend using an HTTPS gateway in front of the server.

Mathematical Reasoning Mode

The OpenChat model also supports mathematical reasoning mode. To use this mode, include condition: "Math Correct" in your request.

```bash
curl http://localhost:18888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "openchat_3.5",
        "condition": "Math Correct",
        "messages": [{"role": "user", "content": "10.3 − 7988.8133 = "}]
      }'
```

Conversation Templates

We provide several pre-built conversation templates to help you get started.

Default Mode (GPT4 Correct):

GPT4 Correct User: Hello<|end_of_turn|>
GPT4 Correct Assistant: Hi<|end_of_turn|>
GPT4 Correct User: How are you today?<|end_of_turn|>
GPT4 Correct Assistant:

Mathematical Reasoning Mode:
```
Math Correct User: 10.3 − 7988.8133=<|end_of_turn|>
Math Correct Assistant:
```
NOTE: Remember to set <|end_of_turn|> as end of generation token.
Integrated Tokenizer: The default (GPT4 Correct) template is also available as the integrated tokenizer.chat_template, which can be used instead of manually specifying the template.

Suggested labels

{ "label": "chat-templates", "description": "Pre-defined conversation structures for specific modes of interaction." }

irthomasthomas removed the New-Label Choose this option if the existing labels are insufficient to describe the content accurately label Mar 4, 2024

irthomasthomas mentioned this issue Mar 5, 2024

What is embedded ML, anyway? - Edge Impulse Documentation #699

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the kernel in Semantic Kernel | Microsoft Learn #679

Understanding the kernel in Semantic Kernel | Microsoft Learn #679

irthomasthomas commented Mar 4, 2024

irthomasthomas commented Mar 4, 2024 •

edited

Loading

Astra Assistant API Service

Getting Started

Third party LLM Support

Coverage

Roadmap

Suggested labels

{ "key": "llm-function-calling", "value": "Integration of function calling with Large Language Models (LLMs)" }

CTransformers

Supported Models

Installation

Usage

🤗 Transformers

LangChain

GPU

CUDA

ROCm

Metal

GPTQ

Documentation

Config

Suggested labels

null

Suggested labels

{ "key": "llm-overlay", "value": "Using Large Language Models to generate documentation and refactoring tools for various activities within Emacs and other editors" }

Suggested labels

{ "key": "sparse-computation", "value": "Optimizing large language models using sparse computation techniques" }

Using the OpenChat Model

Online Deployment

Mathematical Reasoning Mode

Conversation Templates

Suggested labels

{ "label": "chat-templates", "description": "Pre-defined conversation structures for specific modes of interaction." }

Understanding the kernel in Semantic Kernel | Microsoft Learn #679

Understanding the kernel in Semantic Kernel | Microsoft Learn #679

Comments

irthomasthomas commented Mar 4, 2024

Understanding the kernel in Semantic Kernel | Microsoft Learn

irthomasthomas commented Mar 4, 2024 • edited Loading

Related issues

#396: astra-assistants-api: A backend implementation of the OpenAI beta Assistants API

Astra Assistant API Service

Getting Started

Third party LLM Support

Coverage

Roadmap

Suggested labels

{ "key": "llm-function-calling", "value": "Integration of function calling with Large Language Models (LLMs)" }

#499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.

CTransformers

Supported Models

Installation

Usage

🤗 Transformers

LangChain

GPU

CUDA

ROCm

Metal

GPTQ

Documentation

Config

Suggested labels

null

#367: Working Pen.el LSP server. An early demo - An AI overlay for everything -- async and parallelised : r/emacs

Suggested labels

{ "key": "llm-overlay", "value": "Using Large Language Models to generate documentation and refactoring tools for various activities within Emacs and other editors" }

#369: "You are a helpful AI assistant" : r/LocalLLaMA

Suggested labels

{ "key": "sparse-computation", "value": "Optimizing large language models using sparse computation techniques" }

#418: openchat/openchat-3.5-1210 · Hugging Face

Using the OpenChat Model

Online Deployment

Mathematical Reasoning Mode

Conversation Templates

Suggested labels

{ "label": "chat-templates", "description": "Pre-defined conversation structures for specific modes of interaction." }

irthomasthomas commented Mar 4, 2024 •

edited

Loading