GenAI Plugin Suite:
Unlocking the Power of Pentaho Data Integration with GenAI

Helping IT Meet the Demand for Generative AI Applications

Blog categories: Pentaho Data Integration

GenAI, with its ability to generate human-like content, automate complex tasks, and enhance decision-making processes, is proving to be a game-changer across various industries. However, truly unlocking its potential requires organizations to have a robust approach to data integration that can seamlessly blend AI capabilities with their existing data infrastructure.

Pentaho has been making significant investments to both integrate AI capabilities into our products and enable customers to further their AI efforts through key platform features. Our latest delivery in AI is our GenAI Plugin Suite, which will help existing and new Pentaho Data Integration (PDI) customers create and benefit from AI-driven insights.

GenAI Plugin Suite for Pentaho

Pentaho’s suite of tools already streamlines the extraction, transformation, and loading (ETL) of data across diverse sources. However, business needs have evolved around AI and GenAI demands, and so, too, have our data integration capabilities.

The Pentaho Professional Services team, along with our lighthouse customer, developed a set of GenAI plugins specifically for Pentaho Data Integration, ensuring that businesses can leverage the full potential of AI-driven insights while maintaining the integrity and efficiency of their data workflows. These plugins both enhance our core offering and empower organizations to gain deeper insights and drive better outcomes, putting this powerful combination at the center of modern data strategies.

We know data is the cornerstone of any successful GenAI application, and AI models rely heavily on large volumes of high-quality data to train, learn and generate accurate and meaningful outputs. The more diverse, relevant and ready data is the better the model’s ability to produce insights, predictions and human-like content. PDI already delivers clean and transformed data from varied data sources across on-premises and cloud providers.

To integrate GenAI in a seamless and meaningful way we revisited and rethought some of PDI’s core capabilities with input from our lighthouse customer. The output was a series of custom PDI plugins which form the basis for the GenAI Plugin Suite, grouped into three major categories.

1. Data Ingestion Layer

The first category is the Data Ingestion Layer, where raw data from multiple sources is collected, aggregated and prepared for further processing. These custom plugins extend the capabilities of data ingestion and collection within PDI.

Read Unstructured Document

This plugin detects and extracts text from a wide variety of file formats. It can handle a range of document types, including PDFs, Word, PPT and Text, making it a powerful tool for content extraction and data mining of unstructured data, one of the most important features of GenAI applications.

HTML Parser

This plugin parses HTML content and extracts meaningful data from web pages or documents, bringing more value to external third-party data to round out GenAI experiences. It is ideal for web scraping, cleaning up HTML data, or extracting specific elements from complex HTML structures, enabling the end user to scrape the HTML contents and extract the relevant data for GenAI use.

Web Crawler (Planned)

The web crawler plugin is designed to extract data from websites, providing users with the ability to gather unstructured information from the web in a seamless and efficient manner, opening up access and usage of vast amounts of unstructured data from diverse online sources. This data can then be processed through PDI’s GenAI plugins, feeding it into large language models (LLMs) to generate insights, automate content creation or enhance decision-making processes. Automating web data collection enriches the data available for AI applications, driving more comprehensive and informed AI outputs.

2. Transformation Layer

The Transformation Layer plugins support the cleaning, processing and refinement of raw data, converting it into a format suitable for AI training and making it ready for GenAI applications. There are a broad range of existing PDI steps that are already helping our customers achieve their data needs, and these plugins add some additional transformation steps to support GenAI application requirements.

Base64 Encode

A utility plugin that allows the encoding of data (images, text, etc.) into Base64 format, useful for transmitting data, encoding files or preparing data for APIs that require Base64-encoded inputs. The need for Base64 encoded files is important for attaching files to LLMs like OpenAI.

 

Base64 Decode

This plugin will allow for the decoding of Base64 formatted data to its respective file (images, text, etc.), supporting multiple character encoding. The need for Base64 decode plugin will help users convert the Base64 generated response from an LLM to a file.

 

Document Metadata Extractor 

The document metadata extractor plugin aims to extract metadata from a wide variety of file formats like system and file metadata. It can handle a range of document types, including PDFs, Word, PPT and Text, making it a powerful tool for content extraction and data mining to fuel GenAI applications.

3. AI Layer

The AI layer interacts with large language models from providers such as OpenAI and Azure OpenAI. While the GenAI plugins do not aim at building AI models, they play a vital role in seamlessly integrating with both external and local LLM models, ensuring the right data is fed into them for generating responses. This layer is also responsible for processing documents (tokenization) and converting them into vector embeddings using AI models, which is a critical component of a GenAI application.

AI Chat

This plugin provides an interface to connect with LLMs like OpenAI and Azure OpenAI. It generates AI chat responses based on input data and process documents to understand context. The plugin also supports prompt engineering and retrieval augmented generation (RAG) to enable context-based responses on user personal data. The power of this AI Chat plugin in Pentaho lies in its ability to combine seamlessly with other PDI plugins and push traditional ETL processes towards AI modernization.

AI Segmentation

The AI Segmentation Plugin designed to empower users by providing text segmentation capabilities. Leveraging techniques such as token-based, character-based, and regular expression-based segmentation, the plugin seamlessly integrates with LLMs, GenAI, and prompt engineering workflows. By enabling precise text parsing and segmentation, the plugin enhances prompt customization, improves text pre-processing for AI models, and streamlines complex data pipelines. This makes it a valuable asset for end users aiming to maximize the performance and accuracy of AI-driven solutions in tasks like natural language processing, data enrichment, and analytics.

Vector Databases 

Vector databases are specialized databases designed to store and retrieve data as high-dimensional vectors, often used for similarity searches in AI and machine learning applications. They are essential for handling embeddings generated by models like those in GenAI, enabling efficient querying and comparison of complex data such as text, images or audio. The Vector Database PDI plugins are designed to simplify access to a range of databases like PGVector, AlloyDB, Pinecone, Weaviate, Chroma, etc., while also supporting In-Memory file-based vector stores.

Enabling Use-Cases Across the Business

The GenAI Plugin Suite supports a wide variety of use cases, enabling you to apply broad GenAI capabilities within any number of orchestrations and workflows using PDI’s low-code, no-code features.

These plugins also enable the creation of RAG pipelines that can support domain-aware GenAI use cases such as:

  • Content generation incorporating current data – This can improve the relevance and quality of the output generated by the LLM model when creating content such as white papers, videos, blog posts, and marketing and sales campaigns.
  • Customized search – Incorporating data such as company-specific documents, training materials, manuals, etc., can enhance the search experience with more relevant results.

Imagine a company that receives sales financial reports in a PDF document. The sales report holds the executive summary along with product performance across each region and month in a tabular format.

With the help of the GenAI Plugin Suite, Pentaho data engineers and scientists can easily read the content of the financial reports and extract the tabular product performance. With additional help from core PDI plugins, the extracted data can be fed into existing reports and dashboards. PDI enables a low-code / no-code experience, and this use-case takes about 5 minutes to complete and generate the response.

Explore The GenAI Plugin Suite Today!

The GenAI Plugin Suite for Pentaho Data Integration is a game-changer for organizations looking to harness the power of GenAI. Streamlining data ingestion, transformation and integration with large language models, these plugins open new doors for leveraging AI-driven insights across any industry.

Whether you’re dealing with unstructured web data or complex document processing, the GenAI Plugin Suite enhances PDI’s capabilities, bringing advanced AI applications within reach. This suite not only elevates Pentaho’s offering but also empowers businesses to unlock the true potential of their data, driving better outcomes and innovation.

If you’re interested in learning more about the GenAI Plugin Suite and our services, reach out to our Professional Services team to discuss how to get started.