Usage Guide

An example script can be found at /scripts/example_ingestion.py. Note: Currently, this package isn't tested when passing relative paths as input. Thus, it's advised to store the dataset in the same directory as the script to prevent undesired behavior.


See the README for more info on how the design is packaged.


Loading a textual dataset into a vectorstore

This is the quickest way to ingest a dataset and store the results in a vectorstore. It uses a default loading, chunking, and embedding strategy. This can be done in the following manner:

from easy_ingest_text.ingest_text import Ingester

ingester = Ingester()
ingester.ingest_dataset(
    input_dir="financial_dataset.zip",
    is_zipped=True,
    save_intermediate_docs=True,
    output_dir="financial_documents_output",
    detailed_progress=True,
    chunk_batch_size=100,
    max_files=500,
)

If you want to specify different configuration options for the default Loader, Chunker, or Embedder, you can do this by instantiating them individually and passing them to the Ingester.

Here's an example where we set the configuration options for the provided JSONLoader, The Chunker and Embedder can be overridden in a similar manner.

autoloader_config = {
    "JSONLoader": {
        "required": {
            "jq_schema": ".",  
        },
        "optional": {
            "content_key": None,
            "is_content_key_jq_parsable": False,
            "metadata_func": None,
            "text_content": True,
            "json_lines": False,
        },
    },
}
loader = Loader(autoloader_config)
ingester = Ingester(loader=loader)
...

Using Custom Classes

If you want to include custom logic for how to load files, chunk documents, or embed documents, you can subclass the relevant class and pass it to the Ingester. Each class contains information on which methods should be overriden.

Here's an example where a custom loader for loading json documents of a specific format.

from easy_ingest_text.load_text import Loader
from easy_ingest_text.enhanced_document import EnhancedDocument

class CustomLoader(Loader):
    def file_to_docs(self, file_path: str) -> List[EnhancedDocument]:
        file_extension = file_path.split(".")[-1]
        if file_extension == "json":
            with open(file_path) as fin:
                try:
                    data = json.load(fin)
                    text = data["text"]
                    # TODO(STP): Add the filename to the metadata.
                    metadata = {}
                    for key in {
                        "title",
                        "url",
                        "site_full",
                        "language",
                        "published",
                    }:
                        if key in data:
                            metadata[key] = data[key]
                    if "source" in metadata:
                        # HACK(STP): Since source is a reserved keyword for
                        # document metadata, we need to rename it here.
                        metadata["source_"] += metadata["source"]
                    metadata["source"] = file_path
                    return [
                        EnhancedDocument(page_content=text, metadata=metadata)
                    ]
                except Exception as e:
                    print(f"Failed to parse {fin}: {e}. Skipping for now")
                    return []
        else:
            return super().file_to_docs(file_path)

Saving the vectorstore to disk

If using a custom vectorstore, it might offer its own functionality to persist to disk. If using the default FAISS vectorstore, you can save it to disk by setting the save_local flag in its config to True, as shown in the example below.

from easy_ingest_text.ingest_text import Ingester
from easy_ingest_text.embed_text import Embedder
import easy_ingest_text.defaults


vectorstore_config = defaults.DEFAULT_VECTORSTORES_CONFIG
vectorstore_config["FAISS"]["save_local_config"]["save_local"] = True
embedder = Embedder(vectorstore_config=vectorstore_config)
ingester = Ingester(embedder=embedder)
ingester.ingest_dataset(
    input_dir="financial_dataset",
    save_intermediate_docs=True,
    output_dir="financial_documents_output",
    detailed_progress=True,
    chunk_batch_size=100,
    max_files=500,
)