Design Process

Here's the nitty gritty of the step by step design process, the choices I made along the way, and why.

Design Process

Here's the nitty gritty of the step by step design process, the choices I made along the way, and why.

Design Process

Here's the nitty gritty of the step by step design process, the choices I made along the way, and why.

Design Process

Here's the nitty gritty of the step by step design process, the choices I made along the way, and why.

Converting PDFs to Text

My first task was converting 1300+ PDFs of varying readability to text, accurately. I tested a couple of Optical Character Recognition (OCR) options, and ultimately landed with Google's Gemini-powered OCR. Google gave me $300 of free compute for making an account, so the choice was easy - not that the quality of the output wasn't great too.

Converting PDFs to Text

My first task was converting 1300+ PDFs of varying readability to text, accurately. I tested a couple of Optical Character Recognition (OCR) options, and ultimately landed with Google's Gemini-powered OCR. Google gave me $300 of free compute for making an account, so the choice was easy - not that the quality of the output wasn't great too.

Converting PDFs to Text

My first task was converting 1300+ PDFs of varying readability to text, accurately. I tested a couple of Optical Character Recognition (OCR) options, and ultimately landed with Google's Gemini-powered OCR. Google gave me $300 of free compute for making an account, so the choice was easy - not that the quality of the output wasn't great too.

Converting PDFs to Text

My first task was converting 1300+ PDFs of varying readability to text, accurately. I tested a couple of Optical Character Recognition (OCR) options, and ultimately landed with Google's Gemini-powered OCR. Google gave me $300 of free compute for making an account, so the choice was easy - not that the quality of the output wasn't great too.

Cleaning the Text

Because of factors like typewriter keys with insufficient ink, classified stamps, and redactions, the text needed cleaning. I saved important metadata with Python then had Gemini clean up spelling mistakes and spacing errors.

Cleaning the Text

Because of factors like typewriter keys with insufficient ink, classified stamps, and redactions, the text needed cleaning. I saved important metadata with Python then had Gemini clean up spelling mistakes and spacing errors.

Cleaning the Text

Because of factors like typewriter keys with insufficient ink, classified stamps, and redactions, the text needed cleaning. I saved important metadata with Python then had Gemini clean up spelling mistakes and spacing errors.

Cleaning the Text

Because of factors like typewriter keys with insufficient ink, classified stamps, and redactions, the text needed cleaning. I saved important metadata with Python then had Gemini clean up spelling mistakes and spacing errors.

Formatting JSON Documents

When creating a vector database for RAG, you need both the content (which is used to match search results to queries) and metadata (which is given to the AI model to help contextualize the content).

AI models can only handle so much information at once, so document chunking was necessary. I made my chunks as long as possible, because I wanted to give my LLM a lot to work with. CIA documents often start with a list of descriptors that contextualize the entire document. I didn't want chunks in the middle of the document to lose this context, so I extracted it via traditional programming and applied it to each chunk's content.

I also included links and date of publication in the JSON metadata. These data fields were never touched by an LLM and were only handled via traditional programming, ensuring that no hallucination would impact citation validity at this stage.

Formatting JSON Documents

When creating a vector database for RAG, you need both the content (which is used to match search results to queries) and metadata (which is given to the AI model to help contextualize the content).

AI models can only handle so much information at once, so document chunking was necessary. I made my chunks as long as possible, because I wanted to give my LLM a lot to work with. CIA documents often start with a list of descriptors that contextualize the entire document. I didn't want chunks in the middle of the document to lose this context, so I extracted it via traditional programming and applied it to each chunk's content.

I also included links and date of publication in the JSON metadata. These data fields were never touched by an LLM and were only handled via traditional programming, ensuring that no hallucination would impact citation validity at this stage.

Formatting JSON Documents

When creating a vector database for RAG, you need both the content (which is used to match search results to queries) and metadata (which is given to the AI model to help contextualize the content).

AI models can only handle so much information at once, so document chunking was necessary. I made my chunks as long as possible, because I wanted to give my LLM a lot to work with. CIA documents often start with a list of descriptors that contextualize the entire document. I didn't want chunks in the middle of the document to lose this context, so I extracted it via traditional programming and applied it to each chunk's content.

I also included links and date of publication in the JSON metadata. These data fields were never touched by an LLM and were only handled via traditional programming, ensuring that no hallucination would impact citation validity at this stage.

Formatting JSON Documents

When creating a vector database for RAG, you need both the content (which is used to match search results to queries) and metadata (which is given to the AI model to help contextualize the content).

AI models can only handle so much information at once, so document chunking was necessary. I made my chunks as long as possible, because I wanted to give my LLM a lot to work with. CIA documents often start with a list of descriptors that contextualize the entire document. I didn't want chunks in the middle of the document to lose this context, so I extracted it via traditional programming and applied it to each chunk's content.

I also included links and date of publication in the JSON metadata. These data fields were never touched by an LLM and were only handled via traditional programming, ensuring that no hallucination would impact citation validity at this stage.

Creating a Vector Database

A vector database stores document embeddings as numerical vectors, allowing researchers to search for information based on meaning rather than just keywords. For my project, it formed the backbone of the retrieval system, enabling the AI to quickly find the most contextually relevant historical documents when researchers entered queries.

I used Pinecone to build and host my vector database. They are very kind and provide a lot of free storage and querying.

Creating a Vector Database

A vector database stores document embeddings as numerical vectors, allowing researchers to search for information based on meaning rather than just keywords. For my project, it formed the backbone of the retrieval system, enabling the AI to quickly find the most contextually relevant historical documents when researchers entered queries.

I used Pinecone to build and host my vector database. They are very kind and provide a lot of free storage and querying.

Creating a Vector Database

A vector database stores document embeddings as numerical vectors, allowing researchers to search for information based on meaning rather than just keywords. For my project, it formed the backbone of the retrieval system, enabling the AI to quickly find the most contextually relevant historical documents when researchers entered queries.

I used Pinecone to build and host my vector database. They are very kind and provide a lot of free storage and querying.

Creating a Vector Database

A vector database stores document embeddings as numerical vectors, allowing researchers to search for information based on meaning rather than just keywords. For my project, it formed the backbone of the retrieval system, enabling the AI to quickly find the most contextually relevant historical documents when researchers entered queries.

I used Pinecone to build and host my vector database. They are very kind and provide a lot of free storage and querying.

Designing RAG Architecture

Retrieval Augmented Generation at its core is actually really simple; it's just giving an AI relevant sources to read before answering.

My RAG added one intermediate step: I gave relevant sources to multiple agents, which summarized documents with respect to the question asked. Then, those summaries were fed to a final ChatGPT instance to generate a final output.

This was more expensive, but it generated helpful summaries of each document for the footnote section, and separating the summary and synthesis steps made all outputs better.

Designing RAG Architecture

Retrieval Augmented Generation at its core is actually really simple; it's just giving an AI relevant sources to read before answering.

My RAG added one intermediate step: I gave relevant sources to multiple agents, which summarized documents with respect to the question asked. Then, those summaries were fed to a final ChatGPT instance to generate a final output.

This was more expensive, but it generated helpful summaries of each document for the footnote section, and separating the summary and synthesis steps made all outputs better.

Designing RAG Architecture

Retrieval Augmented Generation at its core is actually really simple; it's just giving an AI relevant sources to read before answering.

My RAG added one intermediate step: I gave relevant sources to multiple agents, which summarized documents with respect to the question asked. Then, those summaries were fed to a final ChatGPT instance to generate a final output.

This was more expensive, but it generated helpful summaries of each document for the footnote section, and separating the summary and synthesis steps made all outputs better.

Designing RAG Architecture

Retrieval Augmented Generation at its core is actually really simple; it's just giving an AI relevant sources to read before answering.

My RAG added one intermediate step: I gave relevant sources to multiple agents, which summarized documents with respect to the question asked. Then, those summaries were fed to a final ChatGPT instance to generate a final output.

This was more expensive, but it generated helpful summaries of each document for the footnote section, and separating the summary and synthesis steps made all outputs better.

Designing Front End Interface

Finally, I designed a front end that took in people's questions and displayed ChatGPT's answers. I designed the front end like a search engine rather than a conversation, because my model didn't have memory between queries. I also formatted citations nicely and provided the intermediate summaries for the reader's convenience. Typography was a consideration – to preserve readability, I kept text line length short, even on big screens.

If you've gotten this far without trying it - check it out!

Designing Front End Interface

Finally, I designed a front end that took in people's questions and displayed ChatGPT's answers. I designed the front end like a search engine rather than a conversation, because my model didn't have memory between queries. I also formatted citations nicely and provided the intermediate summaries for the reader's convenience. Typography was a consideration – to preserve readability, I kept text line length short, even on big screens.

If you've gotten this far without trying it - check it out!

Designing Front End Interface

Finally, I designed a front end that took in people's questions and displayed ChatGPT's answers. I designed the front end like a search engine rather than a conversation, because my model didn't have memory between queries. I also formatted citations nicely and provided the intermediate summaries for the reader's convenience. Typography was a consideration – to preserve readability, I kept text line length short, even on big screens.

If you've gotten this far without trying it - check it out!

Designing Front End Interface

Finally, I designed a front end that took in people's questions and displayed ChatGPT's answers. I designed the front end like a search engine rather than a conversation, because my model didn't have memory between queries. I also formatted citations nicely and provided the intermediate summaries for the reader's convenience. Typography was a consideration – to preserve readability, I kept text line length short, even on big screens.

If you've gotten this far without trying it - check it out!

What is ClassifiedKorea?

What is ClassifiedKorea?

What is ClassifiedKorea?

What is ClassifiedKorea?

User Experience in AI?

User Experience in AI?

User Experience in AI?

User Experience in AI?

Design Process

Design Process

Design Process

Design Process

Converting PDFs to Text

Converting PDFs to Text

Converting PDFs to Text

Converting PDFs to Text

Cleaning the Text

Cleaning the Text

Cleaning the Text

Cleaning the Text

Formatting JSON Documents

Formatting JSON Documents

Formatting JSON Documents

Formatting JSON Documents

Creating a Vector Database

Creating a Vector Database

Creating a Vector Database

Creating a Vector Database

Designing RAG Architecture

Designing RAG Architecture

Designing RAG Architecture

Designing RAG Architecture

Designing Front End Interface

Designing Front End Interface

Designing Front End Interface

Designing Front End Interface

Next Steps + Limitations

Next Steps + Limitations

Next Steps + Limitations

Next Steps + Limitations