Design Overview¶
Class Structure¶
NOTE: Mermaid diagrams aren't rendering on GitHub Pages for some reason. See the README for the diagram:
Here's the high level overview of the classes and how they interact with each other.
sequenceDiagram
participant I as Ingester
participant L as Loader
participant C as Chunker
participant E as Embedder
I->>+L: initiate loading
L-->>-I: return EnhancedDocuments
I->>+C: send documents for chunking
C-->>-I: return chunked documents
I->>+E: send chunks for embedding
E-->>-I: return embedded documents
E->>E: save embedded documents to vectorstore (optional)
Note right of I: Ingester coordinates all interactions and manages workflow
While the right level of abstraction when ingesting textual data into a vectorstore is somewhat subjective, three distinct steps stand out:
- Standardizing the input dataset.
- Chunking the standardized data.
- Embedding the chunked data.
Additionally, we make the assumption that the usual access pattern here is that we would want to use the same loading, chunking, and embedding mechanism across multiple datasets within a given application, since this would provide consistency for downstream applications. With this assumption, it makes sense to have a Loader
, Chunker
, and Embedder
class. Each instance of the class would share state information like how it should load, chunk, and embed data. We also have an Ingester
class, which is responsible for transferring data through the instances of the three classes mentioned earlier.
Another layer of abstraction that proves useful is that of a EnhancedDocument
[^3], which is essentially a piece of text accompanied by some additional information. The key information any EnhancedDocument
must have is the following:
source
: the path of the file. We assume the dataset is static (i.e. the raw data does not change).page_content
: the actual text of the file.metadata
: additional information about the document. Often useful for querying within the context of knowledge graphs.document_hash
,content_hash
,metadata_hash
: hashes of the content, metadata, and overall document. Useful for uniqueness checks.
Since there is a one-to-many relationship between a EnhancedDocument
and its chunks - the chunk retains the original document's source
and metadata
- the type of a chunk is the same as the type of an "unchunked" document. Hence, the package treats chunks data as EnhancedDocuments
too.
Each class performs actions related to its position in the ingestion pipeline. The unit of information being transferred between classes is an EnhancedDocument
.
Misc Implementation Details¶
- For performance reasons, I've tried to use iterators and batching where possible, both to leverage vectorization and to be more space efficient.
- For its first iteration, this package does not try to implement upserts/idempotence, nor does it clean indexes (this is the main purpose of LangChain's Indexing API, but I didn't get the time to implement this).
- I did not try to optimize the size of the package itself by reducing the dependencies used - I figured this would not be the limiting factor here. There are probably unused dependencies.[^4]
Footnotes¶
[^3]: The package doesn't use the name Document
since LangChain has claimed that name. Also, LangChain has a _HashedDocument
class that offers similar functionality, but since its a private class, it isn't used by this package.
[^4]: Last minute issues with package conflicts in Google Colab forced me to reduce the size of the dependencies...