チュートリアル2：Cortex Searchで簡単なチャットアプリケーションを構築する¶

概要¶

このチュートリアルでは、Cortex Searchと COMPLETE （SNOWFLAKE.CORTEX）関数を使用して、Snowflakeで検索拡張世代（RAG）チャットボットを設定する方法について説明します。

学習内容¶

Kaggleからダウンロードしたデータセットを基にCortex Search Serviceを作成します。
Cortex Search Serviceにクエリできる Streamlit in Snowflake アプリを作成します。

前提条件¶

このチュートリアルを完了するには、以下の前提条件が必要です。

データベーステーブル、仮想ウェアハウスオブジェクト、Cortex Search Service、およびStreamlitアプリを作成するために必要な権限を付与するロールを持つSnowflakeアカウントとユーザーを持っています。

これらの要件を満たす手順については、 Snowflakeを20分で紹介をご参照ください。

ステップ1: 設定する¶

サンプルデータを取得する¶

このチュートリアルでは、Kaggleでホストされているサンプルデータセットを使用します。本データセットは、本の名前、タイトル、説明のコレクションです。データセットは以下のリンクからダウンロードできます。

完全なデータセットは、 Kaggle にあります。

注釈

チュートリアル以外の設定では、自分のデータ（おそらくすでにSnowflakeテーブルにある）を持参することになります。

データベース、スキーマ、ステージ、ウェアハウスの作成¶

以下の SQL コードを実行して、必要なデータベース、スキーマ、ウェアハウスを設定します。

CREATE DATABASE IF NOT EXISTS cortex_search_tutorial_db;

CREATE OR REPLACE WAREHOUSE cortex_search_tutorial_wh WITH
    WAREHOUSE_SIZE='X-SMALL'
    AUTO_SUSPEND = 120
    AUTO_RESUME = TRUE
    INITIALLY_SUSPENDED=TRUE;

USE WAREHOUSE cortex_search_tutorial_wh;

Copy

次の点に注意してください。

CREATE DATABASE ステートメントはデータベースを作成します。データベースには、 PUBLIC という名前のスキーマが自動的に含まれます。
CREATE WAREHOUSE ステートメントは、中断された初期状態のウェアハウスを作成します。

ステップ2: データをSnowflakeにロードする¶

まず、Kaggleからダウンロードしたファイルを保存するステージを作成します。このステージは本データセットを保持します。

CREATE OR REPLACE STAGE books_data_stage
    DIRECTORY = (ENABLE = TRUE)
    ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');

Copy

データセットをアップロードします。データセットのアップロードは、 Snowsight または SQL を使用します。 Snowsight にアップロードするには:

Snowsight にサインインします。
左側のナビゲーションメニューで Data を選択します。
データベース cortex_search_tutorial_db を選択します。
スキーマ public を選択します。
Stages、 books_data_stage の順に選択します。
右上の + Files ボタンを選択します。
ファイルを UI にドラッグ＆ドロップするか、 Browse を選択してダイアログウィンドウからファイルを選択します。
Upload を選択してファイルをアップロードします、 BooksDatasetClean.csv
ファイルの右側にある3つの点を選択し、 Load into table を選択します。
テーブル名を BOOKS_DATASET_RAW とし、 Next を選択します。
データをロードダイアログの左パネルで、 Header メニューから First line contains header を選択します。
次に Load を選択します。

ステップ3: チャンキング UDF を作成する¶

検索モデルは小さなテキストチャンクで最も効果的に機能するため、長いドキュメントをCortex Searchに入力するとパフォーマンスが低下します。そこで次に、テキストをチャンクするPython UDF を作成します。SQL エディターに戻り、以下を実行します。

CREATE OR REPLACE FUNCTION cortex_search_tutorial_db.public.books_chunk(
    description string, title string, authors string, category string, publisher string
)
    returns table (chunk string, title string, authors string, category string, publisher string)
    language python
    runtime_version = '3.9'
    handler = 'text_chunker'
    packages = ('snowflake-snowpark-python','langchain')
    as
$$
from langchain.text_splitter import RecursiveCharacterTextSplitter
import copy
from typing import Optional

class text_chunker:

    def process(self, description: Optional[str], title: str, authors: str, category: str, publisher: str):
        if description == None:
            description = "" # handle null values

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 2000,
            chunk_overlap  = 300,
            length_function = len
        )
        chunks = text_splitter.split_text(description)
        for chunk in chunks:
            yield (title + "\n" + authors + "\n" + chunk, title, authors, category, publisher) # always chunk with title
$$;

Copy

ステップ4: チャンクテーブルを構築する¶

トランスクリプトから抽出したテキストのチャンクを格納するテーブルを作成します。文脈を示すために、チャンクにタイトルと発言者を含めます。

CREATE TABLE cortex_search_tutorial_db.public.book_description_chunks AS (
    SELECT
        books.*,
        t.CHUNK as CHUNK
    FROM cortex_search_tutorial_db.public.books_dataset_raw books,
        TABLE(cortex_search_tutorial_db.public.books_chunk(books.description, books.title, books.authors, books.category, books.publisher)) t
);

Copy

テーブルの内容を確認します。

SELECT chunk, * FROM book_description_chunks LIMIT 10;

Copy

ステップ5: Cortex Search Serviceを作成する¶

テーブル上にCortex Search Serviceを作成し、 book_description_chunks のチャンクを検索できるようにします。

CREATE CORTEX SEARCH SERVICE cortex_search_tutorial_db.public.books_dataset_service
    ON CHUNK
    WAREHOUSE = cortex_search_tutorial_wh
    TARGET_LAG = '1 hour'
    AS (
        SELECT *
        FROM cortex_search_tutorial_db.public.book_description_chunks
    );

Copy

ステップ6: Streamlitアプリを作成する¶

Python SDK （snowflake Pythonパッケージを使用）でサービスにクエリすることができます。このチュートリアルでは、 Streamlit in Snowflake アプリケーションでPython SDK を使用する例を示します。

まず、グローバル Snowsight UI ロールが、サービス作成ステップでサービスを作成するために使用したロールと同じであることを確認します。

Snowsight にサインインします。
左側のナビゲーションメニューで Projects » Streamlit を選択します。
+ Streamlit App を選択します。
重要: アプリの場所の cortex_search_tutorial_db データベースと public スキーマを選択します。
Streamlit in Snowflake エディターの左ペインで、 Packages を選択し、 snowflake （バージョン>= 0.8.0）を追加して、アプリケーションにパッケージをインストールします。

サンプルアプリケーションコードを以下のStreamlitアプリに置き換えてください。

import streamlit as st
from snowflake.core import Root # requires snowflake>=0.8.0
from snowflake.snowpark.context import get_active_session

MODELS = [
    "mistral-large",
    "snowflake-arctic",
    "llama3-70b",
    "llama3-8b",
]

def init_messages():
    """
    Initialize the session state for chat messages. If the session state indicates that the
    conversation should be cleared or if the "messages" key is not in the session state,
    initialize it as an empty list.
    """
    if st.session_state.clear_conversation or "messages" not in st.session_state:
        st.session_state.messages = []

def init_service_metadata():
    """
    Initialize the session state for cortex search service metadata. Query the available
    cortex search services from the Snowflake session and store their names and search
    columns in the session state.
    """
    if "service_metadata" not in st.session_state:
        services = session.sql("SHOW CORTEX SEARCH SERVICES;").collect()
        service_metadata = []
        if services:
            for s in services:
                svc_name = s["name"]
                svc_search_col = session.sql(
                    f"DESC CORTEX SEARCH SERVICE {svc_name};"
                ).collect()[0]["search_column"]
                service_metadata.append(
                    {"name": svc_name, "search_column": svc_search_col}
                )

        st.session_state.service_metadata = service_metadata

def init_config_options():
    """
    Initialize the configuration options in the Streamlit sidebar. Allow the user to select
    a cortex search service, clear the conversation, toggle debug mode, and toggle the use of
    chat history. Also provide advanced options to select a model, the number of context chunks,
    and the number of chat messages to use in the chat history.
    """
    st.sidebar.selectbox(
        "Select cortex search service:",
        [s["name"] for s in st.session_state.service_metadata],
        key="selected_cortex_search_service",
    )

    st.sidebar.button("Clear conversation", key="clear_conversation")
    st.sidebar.toggle("Debug", key="debug", value=False)
    st.sidebar.toggle("Use chat history", key="use_chat_history", value=True)

    with st.sidebar.expander("Advanced options"):
        st.selectbox("Select model:", MODELS, key="model_name")
        st.number_input(
            "Select number of context chunks",
            value=5,
            key="num_retrieved_chunks",
            min_value=1,
            max_value=10,
        )
        st.number_input(
            "Select number of messages to use in chat history",
            value=5,
            key="num_chat_messages",
            min_value=1,
            max_value=10,
        )

    st.sidebar.expander("Session State").write(st.session_state)

def query_cortex_search_service(query):
    """
    Query the selected cortex search service with the given query and retrieve context documents.
    Display the retrieved context documents in the sidebar if debug mode is enabled. Return the
    context documents as a string.

    Args:
        query (str): The query to search the cortex search service with.

    Returns:
        str: The concatenated string of context documents.
    """
    db, schema = session.get_current_database(), session.get_current_schema()

    cortex_search_service = (
        root.databases[db]
        .schemas[schema]
        .cortex_search_services[st.session_state.selected_cortex_search_service]
    )

    context_documents = cortex_search_service.search(
        query, columns=[], limit=st.session_state.num_retrieved_chunks
    )
    results = context_documents.results

    service_metadata = st.session_state.service_metadata
    search_col = [s["search_column"] for s in service_metadata
                    if s["name"] == st.session_state.selected_cortex_search_service][0]

    context_str = ""
    for i, r in enumerate(results):
        context_str += f"Context document {i+1}: {r[search_col]} \n" + "\n"

    if st.session_state.debug:
        st.sidebar.text_area("Context documents", context_str, height=500)

    return context_str

def get_chat_history():
    """
    Retrieve the chat history from the session state limited to the number of messages specified
    by the user in the sidebar options.

    Returns:
        list: The list of chat messages from the session state.
    """
    start_index = max(
        0, len(st.session_state.messages) - st.session_state.num_chat_messages
    )
    return st.session_state.messages[start_index : len(st.session_state.messages) - 1]

def complete(model, prompt):
    """
    Generate a completion for the given prompt using the specified model.

    Args:
        model (str): The name of the model to use for completion.
        prompt (str): The prompt to generate a completion for.

    Returns:
        str: The generated completion.
    """
    return session.sql("SELECT snowflake.cortex.complete(?,?)", (model, prompt)).collect()[0][0]

def make_chat_history_summary(chat_history, question):
    """
    Generate a summary of the chat history combined with the current question to extend the query
    context. Use the language model to generate this summary.

    Args:
        chat_history (str): The chat history to include in the summary.
        question (str): The current user question to extend with the chat history.

    Returns:
        str: The generated summary of the chat history and question.
    """
    prompt = f"""
        [INST]
        Based on the chat history below and the question, generate a query that extend the question
        with the chat history provided. The query should be in natural language.
        Answer with only the query. Do not add any explanation.

        <chat_history>
        {chat_history}
        </chat_history>
        <question>
        {question}
        </question>
        [/INST]
    """

    summary = complete(st.session_state.model_name, prompt)

    if st.session_state.debug:
        st.sidebar.text_area(
            "Chat history summary", summary.replace("$", "\$"), height=150
        )

    return summary

def create_prompt(user_question):
    """
    Create a prompt for the language model by combining the user question with context retrieved
    from the cortex search service and chat history (if enabled). Format the prompt according to
    the expected input format of the model.

    Args:
        user_question (str): The user's question to generate a prompt for.

    Returns:
        str: The generated prompt for the language model.
    """
    if st.session_state.use_chat_history:
        chat_history = get_chat_history()
        if chat_history != []:
            question_summary = make_chat_history_summary(chat_history, user_question)
            prompt_context = query_cortex_search_service(question_summary)
        else:
            prompt_context = query_cortex_search_service(user_question)
    else:
        prompt_context = query_cortex_search_service(user_question)
        chat_history = ""

    prompt = f"""
            [INST]
            You are a helpful AI chat assistant with RAG capabilities. When a user asks you a question,
            you will also be given context provided between <context> and </context> tags. Use that context
            with the user's chat history provided in the between <chat_history> and </chat_history> tags
            to provide a summary that addresses the user's question. Ensure the answer is coherent, concise,
            and directly relevant to the user's question.

            If the user asks a generic question which cannot be answered with the given context or chat_history,
            just say "I don't know the answer to that question.

            Don't saying things like "according to the provided context".

            <chat_history>
            {chat_history}
            </chat_history>
            <context>
            {prompt_context}
            </context>
            <question>
            {user_question}
            </question>
            [/INST]
            Answer:
        """
    return prompt

def main():
    st.title(f":speech_balloon: Chatbot with Snowflake Cortex")

    init_service_metadata()
    init_config_options()
    init_messages()

    icons = {"assistant": "❄️", "user": "👤"}

    # Display chat messages from history on app rerun
    for message in st.session_state.messages:
        with st.chat_message(message["role"], avatar=icons[message["role"]]):
            st.markdown(message["content"])

    disable_chat = (
        "service_metadata" not in st.session_state
        or len(st.session_state.service_metadata) == 0
    )
    if question := st.chat_input("Ask a question...", disabled=disable_chat):
        # Add user message to chat history
        st.session_state.messages.append({"role": "user", "content": question})
        # Display user message in chat message container
        with st.chat_message("user", avatar=icons["user"]):
            st.markdown(question.replace("$", "\$"))

        # Display assistant response in chat message container
        with st.chat_message("assistant", avatar=icons["assistant"]):
            message_placeholder = st.empty()
            question = question.replace("'", "")
            with st.spinner("Thinking..."):
                generated_response = complete(
                    st.session_state.model_name, create_prompt(question)
                )
                message_placeholder.markdown(generated_response)

        st.session_state.messages.append(
            {"role": "assistant", "content": generated_response}
        )

if __name__ == "__main__":
    session = get_active_session()
    root = Root(session)
    main()

Copy

ステップ7: アプリを試す¶

テキストボックスにクエリを入力して、新しいアプリを試してみます。クエリのサンプルは以下の通りです。

ハリー・ポッターが好きなんだ。私が気に入る本をもっと紹介してもらえますか？
Can you recommend me books on Greek Mythology?

ステップ7: クリーンアップする¶

クリーンアップ（オプション）¶

次の DROP <オブジェクト> コマンドを実行して、システムをチュートリアルを開始する前の状態に戻します。

DROP DATABASE IF EXISTS cortex_search_tutorial_db;
DROP WAREHOUSE IF EXISTS cortex_search_tutorial_wh;

Copy

データベースをドロップすると、テーブルなどのすべての子データベースオブジェクトが自動的に削除されます。

次のステップ¶

おめでとうございます。あなたはSnowflakeでテキストデータの簡単な検索アプリを構築することに成功しました。チュートリアル3 に進み、 PDF ファイル一式からCortex Searchで AI チャットボットを構築する方法をご覧ください。

追加のリソース¶

以下のリソースを使って学習を続行します。