AI_PARSE_DOCUMENT로 문서 구문 분석하기¶

AI_PARSE_DOCUMENT는 문서에서 텍스트, 데이터, 레이아웃 요소, 이미지를 추출하는 Cortex AI 함수입니다. 다른 함수와 함께 사용하여 다양한 사용 사례에 대한 사용자 지정 문서 처리 파이프라인을 생성할 수 있습니다(Cortex AI 함수: 문서 참조).

AI_PARSE_DOCUMENT를 사용한 이미지 추출에 대한 자세한 내용은 Cortex AI 함수: AI_PARSE_DOCUMENT를 사용하여 이미지 추출 섹션을 참조하세요.

함수는 내부 또는 외부 스테이지에 저장된 문서에서 텍스트와 레이아웃을 추출하고 테이블 및 헤더와 같은 읽기 순서와 구조를 유지합니다. 문서 저장에 적합한 스테이지를 만드는 방법에 대한 자세한 내용은 미디어 파일을 위한 스테이지 만들기 섹션을 참조하세요.

AI_PARSE_DOCUMENT는 문서 이해 및 레이아웃 분석을 위한 고급 AI 모델을 조율하여 복잡한 다중 페이지 문서를 높은 정확도로 처리합니다.

AI_PARSE_DOCUMENT 함수는 PDF 문서를 처리하기 위한 두 가지 모드를 제공합니다.

LAYOUT 모드는 대부분의 사용 사례, 특히 복잡한 문서에 대해 선호되는 선택입니다. 이는 특히 텍스트를 추출하거나 테이블과 같은 레이아웃 요소를 추출하는 데 최적화되어 있으므로 지식 기반 구축, 검색 시스템 최적화, AI 기반 애플리케이션 개선에 가장 적합한 선택입니다.
매뉴얼, 계약서, 제품 상세 페이지, 보험 증권 및 청구서, SharePoint 문서 등과 같은 문서에서 빠르고 고품질의 텍스트 추출을 위해서는 OCR 모드를 권장합니다.

두 모드 모두에서 page_split 옵션을 사용하여 응답에서 여러 페이지로 구성된 문서를 별도의 페이지로 분할합니다. page_filter 옵션을 사용하여 지정된 페이지만 처리할 수도 있습니다. ``page_filter``를 사용하는 경우 ``page_split``이 암시되므로 명시적으로 설정할 필요가 없습니다.

AI_PARSE_DOCUMENT는 수평으로 확장 가능하므로 여러 문서를 동시에 효율적으로 일괄 처리할 수 있습니다. 문서는 불필요한 데이터 이동을 피하기 위해 오브젝트 저장소에서 바로 처리할 수 있습니다.

참고

AI_PARSE_DOCUMENT 는 현재 사용자 지정 네트워크 정책 과 호환되지 않습니다.

예¶

간단한 레이아웃 예시¶

This example uses AI_PARSE_DOCUMENT’s LAYOUT mode to process a two-column research paper. The page_split parameter is set to TRUE in order to separate the document into pages in the response. AI_PARSE_DOCUMENT returns the content in Markdown format. The following shows rendered Markdown for one of the processed pages (page index 4 in the JSON output) next to the original page. The raw Markdown is shown in the JSON response following the images.


원본 문서의 페이지	HTML로 렌더링된 추출된 마크다운

팁

이 이미지들을 더 읽기 쉬운 크기로 보려면 클릭하거나 탭하여 선택하세요.

다음은 원본 문서를 처리하기 위한 SQL 명령입니다.

SELECT AI_PARSE_DOCUMENT (
    TO_FILE('@docs.doc_stage','research-paper-example.pdf'),
    {'mode': 'LAYOUT' , 'page_split': true}) AS research_paper_example;

AI_PARSE_DOCUMENT의 응답은 문서 페이지의 메타데이터와 텍스트를 포함하는 JSON 오브젝트입니다. 예를 들어, 다음과 같습니다. 간결함을 위해 일부 페이지 오브젝트가 생략되었습니다.

{
  "metadata": {
    "pageCount": 19
  },
  "pages": [
    {
      "content": "# SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation \n\nAurick Qiao
      Zhewei Yao Samyam Rajbhandari Yuxiong He<br>Snowflake AI Research<br>San Mateo, CA, United States<br>Correspondence:
      aurick.qiao@ snowflake.com\n\n\n#### Abstract\n\nLLM inference for enterprise applications, such as summarization, RAG,
      and code-generation, typically observe much longer prompt than generations, leading to high prefill cost and response
      latency. We present SwiftKV, a novel model transformation and distillation procedure targeted at reducing the prefill
      compute (in FLOPs) of prompt tokens while preserving high generation quality. First, SwiftKV prefills later layers' KV
      cache using an earlier layer's output, allowing prompt tokens to skip those later layers. Second, SwiftKV employs a
      lightweight knowledge-preserving distillation procedure that can adapt existing LLMs with minimal accuracy impact. Third,
      SwiftKV can naturally incorporate KV cache compression to improve inference performance in low-memory scenarios. Our
      comprehensive experiments show that SwiftKV can effectively reduce prefill computation by $25-50 \\%$ across several LLM
      families while incurring minimum quality degradation. In the end-to-end inference serving, SwiftKV realizes up to $2
      \\times$ higher aggregate throughput and $60 \\%$ lower time per output token. It can achieve a staggering 560 TFlops/GPU
      of normalized inference throughput, which translates to 16 K tokens/s for Llama-3.1-70B. SwiftKV is open-sourced at
      https://github . com/snowflakedb/arctictraining.\n\n\n## 1 Introduction\n\nLarge Language Models (LLMs) are now an
      integral enabler of enterprise applications and offerings, including code and data co-pilots (Chen et al., 2021; Pourreza
      and Rafiei, 2024), retrieval augmented generation (RAG) (Lewis et al., 2020; Lin et al., 2024), summarization (Pu et al.,
      2023; Zhang et al., 2024), and agentic workflows (Wang et al., 2024; Schick et al., 2023). However, the cost and speed of
      inference determine their practicality, and improving the throughput and latency of LLM inference has become increasingly
      important.\n\nWhile prior works, such as model pruning (Ma et al., 2023; Sreenivas et al., 2024), KV cache compression
      (Hooper et al., 2024; Shazeer, 2019; Ainslie et al., 2023b; Chang et al., 2024), and sparse attention (Zhao et al., 2024;
      Jiang et al., 2024), have been developed to accelerate LLM inference, they typically significantly degrade the model
      quality or work best in niche scenarios, such as lowmemory environments or extremely long contexts requests (e.g. $>100
      \\mathrm{~K}$ tokens). On the other hand, production deployments are often compute-bound rather than memory-bound, and
      such long-context requests are rare amongst diverse enterprise use cases (e.g. those observed at Snowflake).\n\nIn this
      paper, we take a different approach to improving LLM inference based on the key observation that typical enterprise
      workloads process more input tokens than output tokens. For example, tasks like code completion, text-to-SQL,
      summarization, and RAG each submit long prompts but produce fewer output tokens (a 10:1 ratio with average prompt length
      between 500 and 1000 is observed in our production). In these scenarios, inference throughput and latency are often
      dominated by the cost of prompt processing (i.e. prefill), and reducing this cost is key to improving their performance.
      \n\nBased on this observation, we designed SwiftKV, which improves throughput and latency by reducing the prefill
      computation for prompt tokens. SwiftKV (Fig. 1) consists of three key components:\n\nModel transformation. SwiftKV rewires
      an existing LLM so that the prefill stage during inference can skip a number of later transformer layers, and their KV
      cache are computed by the last unskipped layer. This is motivated by the observation that the hidden states of later
      layers do not change significantly (see Sec. 3.2 and (Liu et al., 2024b)). With SwiftKV, prefill compute is reduced by
      approximately the number of layers skipped.\n\nOptionally, for low-memory scenarios, we",
      "index": 0
    },
    ...
    {
      "content": "Efficient Distillation. Since only a few $\\mathbf{W}_{Q K V}$ parameters need training, we can keep just a
      single copy of the original model weights in memory that are frozen during training, and add an extra trainable copy of
      the $\\mathbf{W}_{Q K V}$ parameters for layers $>l$ initialized using the original model (See Fig. 1).\n\nDuring
      training, we create two modes for the later layers $>l$, one with original frozen parameters using original architecture,
      and another with the SwiftKV re-wiring using new QKV projections i.e.,\n\n$$\n\\begin{aligned}\n& \\mathbf{y}_{\\text
      {teacher }}=\\mathbf{M}(\\mathbf{x}, \\text { SwiftKV }=\\text { False }) \\\\\n& \\mathbf{y}_{\\text {student }}=\\mathbf
      {M}(\\mathbf{x}, \\text { SwiftKV }=\\text { True })\n\\end{aligned}\n$$\n\nwhere $\\mathbf{y}$ is the final logits,
      $\\mathbf{M}$ is the model, and $\\mathbf{x}$ is the input. Afterwards, we apply the standard distillation loss (Hinton et
      al., 2015) on the outputs. After the distillation, the original KV projection layers $>l$ are discarded during inference.
      \n\nThis method allows us to distill Llama-3.1-8BInstruct on 680 M tokens of data in 3 hours using 8 H100 GPUs, and
      Llama-3.1-70B-Instruct in 5 hours using 32 H100 GPUs across 4 nodes. In contrast, many prune-and-distill (Sreenivas et
      al., 2024) and layer-skipping (Elhoushi et al., 2024) methods require much larger datasets (e.g. 10-100B tokens) and incur
      greater accuracy gaps than SwiftKV.\n\n### 3.5 Optimized Implementation for Inference\n\nLLM serving systems can be
      complex and incorporate many simultaneous optimizations at multiple layers of the stack, such as PagedAttention (Kwon et
      al., 2023), Speculative Decoding (Leviathan et al., 2023), SplitFuse (Holmes et al., 2024; Agrawal et al., 2024), and
      more. A benefit of SwiftKV is that it makes minimal changes to the model architecture, so it can be integrated into
      existing serving systems without implementing new kernels (e.g. for custom attention operations or sparse computation) or
      novel inference procedures.\n\nImplementation in vLLM and SGLang. To show that the theoretical compute reductions of
      SwiftKV translates to real-world savings, we integrated it with vLLM (Kwon et al., 2023) and SGLang (Zheng et al., 2024).
      Our implementation is compatible with chunked prefill (Holmes et al., 2024; Agrawal et al., 2024), which mixes chunks of
      prefill tokens and decode tokens in each minibatch. During each forward pass, after completing layer $l$, the KV-cache for
      the remaining layers ( $>l$ ) are immediately computed, and only the decode tokens are propagated through the rest of the
      model layers.\n\n## 4 Main Results\n\nWe evaluated SwiftKV in terms of model accuracy (Sec. 4.1) compared to the original
      model and several baselines, and end-to-end inference performance (Sec. 4.2) in a real serving system.\n\nDistillation
      datasets. Our dataset is a mixture of Ultrachat (Ding et al., 2023), SlimOrca (Lian et al., 2023), and OpenHermes-2.5
      (Teknium, 2023), totaling roughly 680M Llama-3.1 tokens. For more details, please see Appendix A.1.\n\nSwiftKV Notation.
      For prefill computation, we report the approximate reduction as $(L-l) / L$ due to SwiftKV, and for KV cache, we report
      the exact memory reduction due to AcrossKV. For example, SwiftKV $(l=L / 2)$ and 4-way AcrossKV is reported as $50 \\%$
      prefill compute reduction and $37.5 \\% \\mathrm{KV}$ cache memory reduction.\n\n### 4.1 Model Quality Impact of
      SwiftKV\n\nTable 2 shows the quality results of all models we evaluated, including Llama-3.1-Instruct, Qwen2.
      5-14B-Instruct, Mistral-Small, and Deepseek-V2. Of these models, we note that the Llama models span two orders of
      magnitude in size (3B to 405B), Llama-3.1-405B-Instruct uses FP8 (W8A16) quantization, and Deepseek-V2-Lite-Chat is a
      mixture-of-experts model that implements a novel latent attention mechanism (DeepSeek-AI et al., 2024).\n\nWe also compare
      with three baselines: (1) FFN-SkipLLM (Jaiswal et al., 2024), a training-free method for skipping FFN layers (no attention
      layers are skipped) based on hidden state similarity, (2) Llama-3.1-Nemotron-51B-Instruct (Sreenivas et al., 2024), which
      is pruned and distilled from Llama-3.1-70B-Instruct using neural architecture search on 40B tokens, and (3) DarwinLM-8.4B
      (Tang et al., 2025), which is pruned and distilled from Qwen2.5-14B-Instruct using 10B tokens.\n\nSwiftKV. For Llama,
      Mistral, and Deepseek, we find the accuracy degradation for $25 \\%$ SwiftKV is less than $0.5 \\%$ from the original
      models (averaged across tasks). Additionally, the accuracy gap is within $1-2 \\%$ even at $40-50 \\%$ SwiftKV. Beyond $50
      \\%$ SwiftKV, model quality drops quickly. For example, Llama-3.1-8B-Instruct incurs a 7\\% accuracy gap at $62.5 \\%$
      SwiftKV. We find that Qwen suffers larger degradations, at $1.1 \\%$ for $25 \\%$ SwiftKV and $7.4 \\%$ for $50 \\%$
      SwiftKV, which may be due to Qwen models having lower simularity between layer at 50-75\\% depth (Fig. 2). Even still,
      SwiftKV",
      "index": 4
    },
    ...
  ]
}

테이블 구조 추출 예시¶

이 예시에서는 10-K 제출 서류에서 테이블을 포함한 구조적 레이아웃을 추출하는 방법을 보여줍니다. 다음은 처리된 페이지 중 하나(JSON 출력에서 페이지 인덱스 28)에 대한 렌더링 결과를 보여줍니다.


원본 문서의 페이지	HTML로 렌더링된 추출된 마크다운

팁

이 이미지들을 더 읽기 쉬운 크기로 보려면 클릭하거나 탭하여 선택하세요.

다음은 원본 문서를 처리하기 위한 SQL 명령입니다.

SELECT AI_PARSE_DOCUMENT (
    TO_FILE('@docs.doc_stage','10K-example.pdf'),
    {'mode': 'LAYOUT', 'page_split': true}) AS sec_10k_example;

AI_PARSE_DOCUMENT의 응답은 문서 페이지의 메타데이터와 텍스트를 포함하는 JSON 오브젝트입니다. 예를 들어, 다음과 같습니다. 간결함을 위해 이전에 표시된 페이지를 제외한 모든 결과는 생략되었습니다.

{
  "metadata": {
    "pageCount": 53
  },
  "pages": [
    {
      "content": ...
      "index": 0
    },
    ....
    {
      "content": "# Key Operational and Business Metrics \n\nIn addition to the measures presented in our interim condensed
      consolidated financial statements, we use the following key operational and business metrics to evaluate our business,
      measure our performance, develop financial forecasts, and make strategic decisions.\n\n|  | Three Months Ended March 31, |  |
      \n| :--: | :--: | :--: |\n|  | 2025 | 2024 |\n| Ending Paid Connected Fitness Subscriptions ${ }^{(1)}$ | 2,880,176 | $3,051,
      451$ |\n| Average Net Monthly Paid Connected Fitness Subscription Churn ${ }^{(1)}$ | $1.2 \\%$ | $1.2 \\%$ |\n| Ending Paid
      App Subscriptions ${ }^{(1)}$ | 572,775 | 675,190 |\n| Average Monthly Paid App Subscription Churn ${ }^{(1)}$ | $8.1 \\%$ |
      $9.0 \\%$ |\n| Subscription Gross Profit (in millions) | \\$ 288.8 | \\$ 298.1 |\n| Subscription Contribution (in millions) $
      { }^{(2)}$ | \\$ 304.9 | \\$ 316.4 |\n| Subscription Gross Margin | $69.0 \\%$ | $68.1 \\%$ |\n| Subscription Contribution
      Margin ${ }^{(2)}$ | $72.9 \\%$ | $72.3 \\%$ |\n| Net loss (in millions) | \\$ $(47.7)$ | \\$ $(167.3)$ |\n| Adjusted EBITDA
      (in millions) ${ }^{(3)}$ | \\$ 89.4 | \\$ 5.8 |\n| Net cash provided by operating activities (in millions) | \\$ 96.7 | \\$
      11.6 |\n| Free Cash Flow (in millions) ${ }^{(4)}$ | \\$ 94.7 | \\$ 8.6 |\n\n[^0]\n## Ending Paid Connected Fitness
      Subscriptions\n\nEnding Paid Connected Fitness Subscriptions includes all Connected Fitness Subscriptions for which we are
      currently receiving payment (a successful credit card billing or prepaid subscription credit or waiver). We do not include
      paused Connected Fitness Subscriptions in our Ending Paid Connected Fitness Subscription count.\n\n## Average Net Monthly
      Paid Connected Fitness Subscription Churn\n\nTo align with the definition of Ending Paid Connected Fitness Subscriptions
      above, our quarterly Average Net Monthly Paid Connected Fitness Subscription Churn is calculated as follows: Paid Connected
      Fitness Subscriber \"churn count\" in the quarter, divided by the average number of beginning Paid Connected Fitness
      Subscribers each month, divided by three months. \"Churn count\" is defined as quarterly Connected Fitness Subscription
      churn events minus Connected Fitness Subscription unpause events minus Connected Fitness Subscription reactivations.\n\nWe
      refer to any cancellation or pausing of a subscription for our All-Access Membership as a churn event. Because we do not
      receive payment for paused Connected Fitness Subscriptions, a paused Connected Fitness Subscription is treated as a churn
      event at the time the pause goes into effect, which is the start of the next billing cycle. An unpause event occurs when a
      pause period elapses without a cancellation and the Connected Fitness Subscription resumes, and is therefore counted as a
      reduction in our churn count in that period. Our churn count is shown net of reactivations and our new quarterly Average Net
      Monthly Paid Connected Fitness Subscription Churn metric averages the monthly Connected Fitness churn percentage across the
      three months of the reported quarter.\n\n## Ending Paid App Subscriptions\n\nEnding Paid App Subscriptions include all App
      One, App+, and Strength+ subscriptions for which we are currently receiving payment.\n\n## Average Monthly Paid App
      Subscription Churn\n\nWhen a Subscriber to App One, App+, or Strength+ cancels their membership (a churn event) and
      resubscribes in a subsequent period, the resubscription is considered a new subscription (rather than a reactivation that is
      counted as a reduction in our churn count). Average Paid App Subscription Churn is calculated as follows: Paid App
      Subscription cancellations in the quarter, divided by the average number of beginning Paid App Subscriptions each month,
      divided by three months.\n\n\n[^0]:    (1) Beginning January 1, 2025, the Company migrated its subscription data model for
      reporting Ending Paid Connected Fitness Subscriptions, Average Net Monthly Paid Connected Fitness Subscription Churn, Ending
      Paid App Subscriptions, and Average Monthly Paid App Subscription Churn to a new data model that provides greater visibility
      to changes to a subscription's payment status when they occur. The new model gives the Company more precise and timely data
      on subscription pause and churn behavior. Prior period information has been revised to conform with current period
      presentation. The impact of this change in the model on Ending Paid Connected Fitness Subscriptions, Average Net Monthly
      Paid Connected Fitness Subscription Churn, Ending Paid App Subscriptions and Average Monthly Paid App Subscription Churn for
      the three months ended March 31, 2025 and 2024 is immaterial.\n    (2) Please see the section titled \"Non-GAAP Financial
      Measures—Subscription Contribution and Subscription Contribution Margin\" for a reconciliation of Subscription Gross Profit
      to Subscription Contribution and an explanation of why we consider Subscription Contribution and Subscription Contribution
      Margin to be helpful measures for investors.\n    (3) Please see the section titled \"Non-GAAP Financial Measures—Adjusted
      EBITDA\" for a reconciliation of Net loss to Adjusted EBITDA and an explanation of why we consider Adjusted EBITDA to be a
      helpful measure for investors.\n    (4) Please see the section titled \"Non-GAAP Financial Measures-Free Cash Flow\" for a
      reconciliation of net cash provided by (used in) operating activities to Free Cash Flow and an explanation of why we
      consider Free Cash Flow to be a helpful measure for investors.",
      "index": 28
    },
    ...
    {
      "content": "# CERTIFICATION OF PRINCIPAL FINANCIAL OFFICER PURSUANT TO 18 U.S.C. SECTION 1350, AS ADOPTED PURSUANT TO
      SECTION 906 OF THE SARBANES-OXLEY ACT OF 2002 \n\nI, Elizabeth F Coddington, Chief Financial Officer of Peloton Interactive,
      Inc. (the \"Company\"), do hereby certify, pursuant to 18 U.S.C. Section 1350, as adopted pursuant to Section 906 of the
      Sarbanes-Oxley Act of 2002, that to the best of my knowledge:\n\n1. the Quarterly Report on Form 10-Q of the Company for the
      fiscal quarter ended March 31, 2025 (the \"Report\") fully complies with the requirements of Section 13(a) or 15(d) of the
      Securities Exchange Act of 1934, as amended; and\n2. the information contained in the Report fairly presents, in all
      material respects, the financial condition, and results of operations of the Company.\n\nDate: May 8, 2025\n\nBy: /s/
      Elizabeth F Coddington\nElizabeth F Coddington\nChief Financial Officer\n(Principal Financial Officer)",
      "index": 52
    }
  ]
}

슬라이드 데크 예시¶

이 예시에서는 프레젠테이션에서 구조적 레이아웃을 추출하는 방법을 보여줍니다. 아래에는 처리된 슬라이드 중 하나(JSON 출력에서 페이지 인덱스 17)의 렌더링 결과를 보여줍니다.


원본 문서의 슬라이드	HTML로 렌더링된 추출된 마크다운

팁

이 이미지들을 더 읽기 쉬운 크기로 보려면 클릭하거나 탭하여 선택하세요.

다음은 원본 문서를 처리하기 위한 SQL 명령입니다.

SELECT AI_PARSE_DOCUMENT (TO_FILE('@docs.doc_stage','presentation.pptx'),
    {'mode': 'LAYOUT' , 'page_split': true}) as presentation_output;

AI_PARSE_DOCUMENT의 응답은 프레젠테이션 슬라이드의 메타데이터와 텍스트를 포함하는 JSON 오브젝트입니다. 예를 들어, 다음과 같습니다. 간결함을 위해 일부 슬라이드의 결과는 생략되었습니다.

{
  "metadata": {
    "pageCount": 38
  },
  "pages": [
    {
      "content": "![img-0.jpeg](img-0.jpeg)\n\n# **SNOWFLAKE INVESTOR PRESENTATION**\n\nFirst Quarter Fiscal 2026\n\n© 2026 Snowflake Inc. All Rights Reserved",
      "index": 0
    },
    ...
    {
      "content": "# Our Consumption Model \n\n## Revenue Recognition Consumption\n\nSnowflake recognizes the substantial majority of its revenue as customers consume the platform\n\nPro: Enables faster growth\nPro: Aligned with customer value\nPro: Aligned with usage-based costs\nConsider: Revenue is variable based on customers' usage\n\n## Pricing Model Consumption\n\nThe platform is priced based on consumption of compute, storage, and data transfer resources\n\nPro: Customers don't pay for shelfware\n\nConsider: Performance improvements inherently reduce customer cost\n\n## Billings Terms Typically Upfront\n\nSnowflake typically bills customers annually in advance for their capacity contracts\n\nSome customers consume on-demand and/or are billed in-arrears\n\nPro: Bookings represent contractual minimum\n\nPro: Variable consumption creates upside for renewal cycle\n\nConsider: Payment terms are evolving",
      "index": 17
    },
    ...
    {
      "content": "![img-23.jpeg](img-23.jpeg)\n\n# PRODUCT REVENUE\n\n## $996.8M + 26% YoY Growth\n\n## NET REVENUE RETENTION RATE\n\n## $124%\n\n## TOTAL CUSTOMERS\n\n## $1M+ CUSTOMERS\n\n## $0.5 + 27% YoY Growth\n\nCustomers with Trailing 12-Month Product Revenue Greater than $1M\n\n## FORBES GLOBAL 2000 CUSTOMERS\n\n## $754 + 4% YoY Growth\n\n## SNOWFLAKE MARKETPLACE LISTINGS\n\n## AI/ML ADOPTION\n\n## 5,200+ Accounts using Snowflake AI/ML\n\n## SNOWFLAKE AI DATA CLOUD\n\n### Unified Platform and Connected Ecosystem\n\n- **Data Engineering**\n- **Analytics**\n- **AI**\n- **Applications & Collaboration**\n\n### Fully Managed | Cross-Cloud | Interoperable | Secure | Governed\n\n1. For the three months ended April 30, 2025.\n2. As of April 30, 2025. Please see our Q1FY26 earnings press release for definitions of net revenue retention rate, customers with trailing 12-month product revenue greater than $1 million (which definition includes a description of our total customer count), and Forbes Global 2000 customers.\n3. As of April 30, 2025. Each live dataset, package of datasets, or data service published by a data provider as a single product offering on Snowflake Marketplace is counted as a unique listing. A listing may be available in one or more regions where Snowflake Marketplace is available.\n4. Adoption is based on capacity and on-demand accounts using Snowflake AI/ML features on a weekly basis via our internal classification. We take the average of the last 4 weeks of the quarter ended April 30, 2025.",
      "index": 36
    },
    {
      "content": "# THANK YOU\n\n![img-24.jpeg](img-24.jpeg)",
      "index": 37
    }
  ]
}

다국어 문서 예시¶

이 예시에서는 독일어 기사에서 구조적 레이아웃을 추출함으로써 AI_PARSE_DOCUMENT의 다국어 기능을 보여줍니다. AI_PARSE_DOCUMENT는 이미지나 인용문이 포함된 경우에도 본문 텍스트의 읽기 순서를 유지합니다.


원본 문서의 페이지	HTML로 렌더링된 추출된 마크다운

팁

이 이미지들을 더 읽기 쉬운 크기로 보려면 클릭하거나 탭하여 선택하세요.

다음은 원본 문서를 처리하기 위한 SQL 명령입니다. 문서가 한 페이지로 구성되어 있으므로, 이 예시에서는 페이지 분할이 필요하지 않습니다.

SELECT AI_PARSE_DOCUMENT (TO_FILE('@docs.doc_stage','german_example.pdf'),
    {'mode': 'LAYOUT'}) AS german_article;

AI_PARSE_DOCUMENT의 응답은 문서의 메타데이터와 텍스트를 포함하는 JSON 오브젝트입니다. 예를 들어, 다음과 같습니다.

{
  "metadata": {
    "pageCount": 1
  }
  "content": "![img-0.jpeg](img-0.jpeg)\n\nSchulen haben es verdient, gute Orte zu sein. Hier sollen wir Wissen und Fähigkeiten
  erlernen, die uns durch das Leben tragen. Hier verbringen viele einen Großteil ihres Tages, und das in einer Lebensphase, in
  der sich Zeit beinahe grenzenlos und eine Doppelstunde wie ein halbes Leben anfühlen kann.\n\nOb es die Freundin ist, ohne die
  man auf dem Schulhof verloren wäre. Der Lehrer, mit dem man nicht klarkommt, den man aber trotzdem jeden Tag aushalten muss.
  Die Klassenfahrt, auf der man zum ersten Mal das Meer sieht und knutscht. In Schulen entstehen Erfahrungen, Beziehungen und
  Erinnerungen, die uns ein ganzes Leben prägen.\n\nDie Erwartungen an Schulen sind dementsprechend hoch. Trotzdem werden sie
  von der Gesellschaft schnell vergessen und von der Politik hinten angestellt. Seit Jahrzehnten kriegt das deutsche Schulsystem
  verheerende Zeugnisse.\n\nNoch immer entscheiden Bildungsgrad und Kontostand der Eltern darüber, welchen Schulabschluss Kinder
  und Jugendliche machen. Noch immer funktioniert es vielerorts nur auf dem Papier, dass alle gut zusammen lernen. Im Alltag
  fehlen dann die Lehrkräfte und Mittel, um zum Beispiel einen geflüchteten Jugendlichen oder einen mit ADHS so zu unterstützen,
  dass alle möglichst gleichberechtigt in einem Klassenraum sitzen. Auch die gesellschaftliche Einsicht, dass alle
  Schulabschlüsse ihren Wert haben und gebraucht werden, muss erst wieder zurückgewonnen werden.\n\nJetzt aber hoch mit
  euch!\nDass Schule so irre früh anfangen muss, ist kein Gesetz. Und auch gar nicht ratsam: Jugendliche haben einen anderen
  Biorhythmus und brauchen mehr Schlaf als Erwachsene. Ein Schulbeginn gegen 9 oder 10 Uhr wäre für die meisten besser, da ist
  sich die Forschung einig\n\nAn Schulen tritt die Realität sehr schnell ein. Während sich die Gesellschaft noch fragt, wie mit
  künstlicher Intelligenz umzugehen ist, nutzen sie Lehrkräfte, Schülerinnen und Schüler längst für ihre Zwecke. Während über
  Jahre diskutiert wurde, ob Deutschland ein Einwanderungsland sei, war es das an Schulen längst. Und während andere Themen den
  Klimawandel in der Öffentlichkeit verdrängen, sind es besonders Schülerinnen und Schüler, die laut auf das drängendste Problem
  unserer Zeit hinweisen. Die Herausforderungen und Fragen, die sich an Schulen stellen, betreffen uns alle. Schule ist Zukunft.
  \n\nSchulleitungen, Lehrkräfte, pädagogisches Personal und alle, die sich sonst noch um das Gelingen des Schulalltags kümmern,
  stellen sich dem jeden Tag aufs Neue. Sie versuchen, Schule trotz vieler Probleme und fehlender Wertschätzung zu gestalten,
  sie versuchen, den Schülerinnen und Schülern zu vermitteln, dass es auf sie ankommt. Damit sie selbst an sich glauben. Sie
  haben es verdient.",
}

Snowflake Cortex는 지원되는 모든 언어(이 경우 영어, 언어 코드 'en')로 다음과 같이 번역을 생성할 수 있습니다:

SNOWFLAKE.CORTEX.TRANSLATE (ger_example, '', 'en') from german_article;

번역은 다음과 같습니다.

"Schools deserve to be good places. Here, we are supposed to learn knowledge and skills that will carry us through life. Many
spend a large part of their day here, and this is during a phase of life when time can seem almost endless and a double period
can feel like half a lifetime.

Whether it's the friend you would be lost without in the schoolyard. The teacher you can't get along with, but still have to
endure every day. The class trip where you see the sea for the first time and make out. In schools, experiences,
relationships, and memories are created that shape us for a lifetime.

The expectations for schools are correspondingly high. Nevertheless, they are quickly forgotten by society and pushed to the
back by politics. For decades, the German school system has been receiving devastating reports.

Even now, the level of education and the financial status of the parents still determine which school certificate children and
young people receive. It still only works on paper that everyone learns well together. In everyday life, the teachers and
resources are lacking to support, for example, a refugee youth or a student with ADHD so that they can sit in a classroom on
an equal footing. The societal insight that all school certificates have value and are needed also needs to be regained.

Now, let's get going!

The fact that school has to start so early is not a law. And it's not advisable either: teenagers have a different biological
rhythm and need more sleep than adults. A start time of 9 or 10 o'clock would be better for most, research agrees.

Reality sets in very quickly at schools. While society is still wondering how to deal with artificial intelligence, teachers,
students, and pupils are already being used for their purposes. While it was debated for years whether Germany is an
immigration country, it has been one in schools for a long time. And while other topics are pushing climate change out of the
public eye, it is especially students who are loudly pointing out the most pressing problem of our time. The challenges and
questions that schools face affect us all. School is the future.

School administrations, teachers, educational staff, and all those who take care of the success of everyday school life face
this every day. They try to shape school despite many problems and lack of appreciation, they try to convey to the students
that it's up to them. So that they believe in themselves. They deserve it."

OCR 모드 사용¶

OCR 모드는 스캔된 문서(예: 스크린샷 또는 텍스트 이미지가 포함된 PDFs)에서 텍스트를 추출합니다. 이는 레이아웃을 유지하지 않습니다.

SELECT AI_PARSE_DOCUMENT(
  TO_FILE( '@docs.doc_stage', 'document_1.pdf' ),
  { 'mode': 'OCR' } ) AS OCR;

출력:

{
  "content": "content of the document"
}

문서의 특정 페이지만 처리¶

This example demonstrates using the page_filter option to extract specific pages from a document, specifically the first page of a 55-page research paper. Keep in mind that page indexes starts at 0 and ranges are inclusive of the start value but exclusive of the end value. For example, start: 0, end: 1 returns only the first page (index 0).

SELECT AI_PARSE_DOCUMENT(
  TO_FILE('@my_documents', 'ResearchArticle.pdf'),
  {'mode': 'LAYOUT', 'page_filter': [{'start': 0, 'end': 1}]} );

결과:

{
  "metadata": {
    "pageCount": 55
  },
  "pages": [
    {
      "content": "# The Critical Role of Strength Training in Lifelong Health: Evidence-Based
      Benefits and Implementation Strategies \n\n\n#### Abstract\n\nBackground: Strength training
      has emerged as one of the most powerful interventions for promoting health across the
      lifespan. This comprehensive review examines the extensive evidence supporting strength
      training's role in preventing chronic disease, maintaining functional independence, and
      enhancing quality of life.\n\nMethods: We conducted a systematic review of peer-reviewed
      literature published between 2018-2024, analyzing 127 studies involving over 45,000
      participants across various populations.\n\nResults: Regular resistance exercise provides
      cardiovascular benefits ( $15-20 \\%$ reduction in heart disease risk), metabolic improvements
      ( $12-18 \\%$ better insulin sensitivity), cognitive enhancements ( $25 \\%$ slower
      cognitive decline), and psychological well-being improvements. Strength training increases
      bone mineral density by $1-3 \\%$ annually and reduces fall risk by up to $40 \\%$ in older
      adults.\n\nConclusions: Current guidelines recommend at least two sessions per week targeting
      all major muscle groups. Implementation of strength training programs should be considered a
      public health priority given the substantial evidence for disease prevention and health
      promotion.\n\n\nKeywords: resistance training, muscle strength, bone density, chronic
      disease prevention, healthy aging, exercise prescription\n\n## Introduction\n\nThe human
      musculoskeletal system is designed for regular mechanical loading and progressive challenge.
      Throughout evolutionary history, our ancestors engaged in strength-demanding activities
      essential for survival, maintaining robust muscle mass and bone density well into advanced age.
      However, the modern sedentary lifestyle has created an unprecedented mismatch between our
      biological needs and daily activities, contributing to rising rates of sarcopenia,
      osteoporosis, and metabolic dysfunction.\n\nStrength training, also known as resistance
      training or weight training, represents a targeted intervention that can address many
      contemporary health challenges. Unlike aerobic exercise alone, resistance training provides
      unique physiological adaptations that are essential for long-term health and functional
      independence. The World Health Organization now recognizes strength training as a fundamental
      component of physical activity guidelines for all adults.\n\nKey Statistics: Only 31\\% of
      adults meet strength training recommendations, despite evidence showing $20-30 \\%$ reductions
      in all-cause mortality among regular participants.\n\n## Physiological Mechanisms and
      Adaptations\n\n## Musculoskeletal Benefits\n\nStrength training stimulates muscle protein
      synthesis through mechanistic target of rapamycin (mTOR) pathway activation, leading to
      increased muscle fiber size and improved neuromuscular coordination. Research demonstrates
      that adults can increase muscle mass by $2-4 \\%$ per month during initial training phases,
      with continued improvements possible throughout life.\n\nBone tissue responds to mechanical
      loading through osteoblast activation and increased bone formation. Weight-bearing resistance
      exercises create piezoelectric effects that stimulate osteocyte networks, resulting in
      improved bone mineral density and reduced fracture risk. Studies show 1-3\\% annual",
      "index": 0
    }
  ]
}

여러 문서 분류¶

여러 문서를 분류하려면 먼저 디렉터리에서 문서 위치를 검색하고 이러한 위치를 FILE 오브젝트로 변환하여 파일 테이블을 만듭니다.

CREATE TABLE documents_table AS
  (SELECT TO_FILE('@my_documents', RELATIVE_PATH)
    AS docs FROM DIRECTORY(@my_documents));

그런 다음, 테이블의 각 문서에 AI_PARSE_DOCUMENT를 적용하고 결과를 처리합니다. 예를 들어, 결과를 AI_CLASSIFY에 전달하여 문서를 유형별로 분류합니다. 이 방법은 문서 컬렉션에서 문서를 일괄적으로 분석하는 데 효율적입니다.

WITH single_page_extraction as (
  SELECT
  TO_VARCHAR (AI_PARSE_DOCUMENT(docs, {'mode': 'LAYOUT',
    'page_filter': [{'start': 0, 'end': 1}]} )) AS first_page FROM documents_table)
SELECT AI_CLASSIFY(
  first_page,
  ['health', 'fitness','economics', 'science', 'psychology' ,'sociology','statistics', 'finance', 'Artificial Intelligence', 'Analytics'],
  {'output_mode': 'multi'} ) as article_classification
FROM single_page_extraction;

이 쿼리는 각 문서에 대한 분류 레이블을 반환합니다.

{ "labels": [ "health", "psychology", "science" ] }
{ "labels": [ "fitness", "health", "science" ] }
{ "labels": [ "Analytics", "Artificial Intelligence" ] }
{ "labels": [ "finance", "Analytics" ] }

..

{ "labels": [ "finance" ] }
{ "labels": [ "Artificial Intelligence", "science" ] }
{ "labels": [ "Artificial Intelligence", "science" ] }
{ "labels": [ "fitness", "health", "science" ] }

입력 요구 사항¶

AI_PARSE_DOCUMENT는 디지털 문서와 스캔 문서 모두에 최적화되어 있습니다. 다음 테이블은 입력 문서의 제한 사항 및 요구 사항을 나열합니다.


최대 파일 크기	100 MB
문서당 최대 페이지 수	500
최대 페이지 해상도	10000 x 10000픽셀 33.3 x 33.3인치(300 DPI에서) 2400 x 2400포인트(300 DPI에서)
지원되는 파일 유형	PDF, PPTX, DOCX, JPEG, JPG, PNG, TIFF, TIF, HTML, TXT
스테이지 암호화	서버 측 암호화:
글꼴 크기	최상의 결과를 얻으려면 8포인트 이상

지원되는 문서 기능 및 제한 사항¶


페이지 방향	AI_PARSE_DOCUMENT는 자동으로 페이지 방향을 감지합니다.
페이지 분할	AI_PARSE_DOCUMENT는 여러 페이지로 구성된 문서를 개별 페이지로 분할하고 각각을 별도로 구문 분석할 수 있습니다. 이 방법은 최대 크기를 초과하는 큰 문서를 처리하는 데 유용합니다.
페이지 필터링	AI_PARSE_DOCUMENT는 페이지 범위를 지정하여 문서의 모든 페이지가 아닌 일부 페이지를 처리할 수 있습니다. 이 방법은 찾고 있는 정보가 어떤 페이지에 있는지 알고 있을 때 유용합니다.
문자	AI_PARSE_DOCUMENT 는 다음 문자를 감지합니다. a-z A-Z 0-9 À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Ą ą Ć ć Č č Đ đ Ę ę ı Ł ł Ń ń ō Œ œ Ś ś Š š Ÿ Ź ź Ż ż Ž ž ʒ β δ ε з Ṡ ! “ # $ % & ‘ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { \| } ~ ¡ ¢ £ ¥ § © ª « ® ¯ ° ± ² ³ ´ µ ¶ · º » ¿ ‘ † ‡ • ‣ ⁋ ₣ ₤ ₦ ₩ € ₭ ₹ ™ ← ↑ → ↓ ↔ ↕ ↖ ↗ ↘ ↙ ↰ ↱ ↲ ↳ ↴ ↵
이미지	AI_PARSE_DOCUMENT는 문서 내 이미지에 대한 마크업을 생성하지만, 현재 실제 이미지를 추출하지는 않습니다.
구조화된 요소들	AI_PARSE_DOCUMENT는 테이블과 양식을 자동으로 감지하고 추출합니다.
글꼴	AI_PARSE_DOCUMENT는 대부분의 세리프 및 산세리프 글꼴의 텍스트를 인식하지만, 장식용 또는 필기체 글꼴의 경우 인식에 어려움이 있을 수 있습니다. 이 함수는 손글씨를 인식하지 못합니다.

지원되는 언어¶

AI_PARSE_DOCUMENT는 다음 언어에 대해 훈련되었습니다.


OCR 모드	LAYOUT 모드
영어 프랑스어 독일어 이탈리아어 노르웨이어 폴란드어 포르투갈어 스페인어 스웨덴어	중국어 영어 프랑스어 독일어 힌디어 이탈리아어 포르투갈어 루마니아어 러시아어 스페인어 터키어 우크라이나어

리전 가용성¶

AI_PARSE_DOCUMENT에 대한 지원은 다음 Snowflake 리전의 계정에서 사용할 수 있습니다.


AWS	Azure	Google Cloud Platform
US 서부 2(오리건)	동부 US 2(버지니아)	US 중부(아이오와)
US 동부(오하이오)	서부 US 2(워싱턴)
US 동부 1(북부 버지니아)	유럽(네덜란드)
유럽(아일랜드)
유럽 중부 1(프랑크푸르트)
Europe West 2 (London)
아시아 태평양(시드니)
아시아 태평양(도쿄)

AI_PARSE_DOCUMENT는 다른 Snowflake 리전에서 리전 간 지원을 제공합니다. Cortex AI 리전 간 지원 활성화에 대한 정보는 리전 간 추론 섹션을 참조하세요.

액세스 제어 요구 사항¶

AI_PARSE_DOCUMENT 함수를 사용하려면 ACCOUNTADMIN 역할을 가진 사용자가 함수를 호출할 사용자에게 SNOWFLAKE.CORTEX_USER 데이터베이스 역할을 부여해야 합니다. 자세한 내용은 Cortex LLM 권한 항목을 참조하세요.

비용 고려 사항¶

Cortex AI_PARSE_DOCUMENT 함수는 처리된 문서당 페이지 수에 따라 컴퓨팅 비용이 발생합니다. 다음은 다양한 파일 형식에 대해 페이지가 계산되는 방식을 설명합니다.

페이지 파일 형식(PDF, DOCX)의 경우, 문서 내 각 페이지가 페이지 단위로 청구됩니다.
이미지 파일 형식(JPEG, JPG, TIF, TIFF, PNG)의 경우 각 개별 이미지 파일이 페이지 단위로 청구됩니다.
HTML 및 TXT 파일의 경우, 3,000자 단위로 구성된 각 청크가 페이지 단위로 청구됩니다. 마지막 청크는 3,000자 미만일 수 있습니다.

Snowflake는 더 작은 웨어하우스(MEDIUM보다 크지 않음)에서 Cortex AI_PARSE_DOCUMENT 함수를 호출하는 쿼리를 실행할 것을 권장합니다. 웨어하우스가 크다고 성능이 향상되는 것은 아닙니다.

오류 조건¶

Snowflake Cortex AI_PARSE_DOCUMENT는 다음과 같은 오류 메시지를 표시할 수 있습니다.


메시지	설명
`Document contains language that is not supported.`	입력 문서에 지원되지 않는 언어가 포함되어 있습니다
`The provided file format {file_extension} isn't supported. Supported formats: .['.docx', '.pptx', '.pdf'].`	문서가 지원되지 않는 형식입니다.
`The provided file format .bin isn't supported. Supported formats: ['.docx', '.pptx', '.pdf']. Ensure the file is stored with server-side encryption.`	이 파일 형식은 지원되지 않으며 바이너리 파일로 이해됩니다.
`Maximum number of 500 pages exceeded. The document has {actual_pages} pages.`	문서가 500페이지 제한을 초과합니다.
`Page size in pixels exceeds 10000x10000. The page size is {actual_px} pixels.`	이미지 입력 또는 변환된 문서 페이지가 지원되는 크기보다 큽니다.
`Page size in inches exceeds 50x50 (3600x3600 pt). The page size is {actual_in} inches ({actual_pt} pt).`	페이지가 지원되는 크기보다 큽니다.
`Maximum file size of 104857600 bytes exceeded. The file size is {actual_size} bytes.`	문서가 100MB보다 큽니다.
`Provided file cannot be found.`	파일이 존재하지 않습니다.
`Provided file cannot be accessed.`	권한이 부족하여 파일에 액세스할 수 없습니다.
`The Parse Document function did not respond in the allowed time.`	시간 제한이 발생했습니다.
`Internal error.`	시스템 오류가 발생했습니다. 잠시 기다린 후 다시 시도하세요.

법적 고지¶

입력 및 출력의 데이터 분류는 다음 테이블과 같습니다.


입력 데이터 분류	출력 데이터 분류	지정
Usage Data	Customer Data	일반적으로 사용 가능한 함수는 Covered AI 기능입니다. 미리 보기 함수는 Preview AI 기능입니다. [1]

자세한 내용은 Snowflake AI 및 ML 섹션을 참조하십시오.