Semantic query in 1.7 million SEC 8-K forms

Challenge: Find all the forms relevant to a linguistically ambiguous search query (although semantically clear), and interpret the data within.

Feb 26, 2026

The US SEC (U.S. Securities and Exchange Commission) is a very large and rich dataset for market insights. Many shops make their business out of interpreting the data within, or selling the data within as insights.

It has millions of forms, all with financial reports and so on, mandated by the SEC itself of each company on the stock market.

From 2005-2025 it has approximately 1,715,373 8-K forms.

Challenge: Find all the forms relevant to a linguistically ambiguous search query (although semantically clear), and interpret the data within.

(Note, in order to not divulge any alpha from anyone, I'll be using innocent examples referring to fruits and other foods instead.)

Naive approach #1: Run every document through an LLM like OpenAI's GPT models, and ask it the question you're looking for ("is this about beans?"). The problem is it will be very slow to do that for 1.7 million documents, and it will be very expensive.

Naive approach #2: Filter with regex first, and then run it through the LLM. Unfortunately "beans" have a variety of names, and even if beans are mentioned in the document, it could be for the phrase "he lost his beans" and might not be relevant. In addition, maybe every single document mentions the word "bean" because every document talks about "beanstalks"!

The now modern approach is a familiar subject to anybody working with LLMs: RAG.

Here's what you do:

Download the indexes from the SEC, which tell you every form ever submitted. Filter it down to the 8-Ks and download those 8-Ks. (Note that the SEC allows you to do this, within certain rate limits.) This step will take up over one terabyte of harddrive space.
Ingest every 8-K and split its content into chunks you think are semantically reasonable.
Embed each of those chunks, and insert into your database.
Index your database so that searching through 1.7 million forms (which depending on how you split it, will produce millions of chunks) doesn't take very long.

Now every time you need to make a semantic query, you have two steps:

Filter it down, semantically, using the embeddings, down to the relevant chunks.
Pass the content of the form (or the relevant chunks) to an LLM for getting the information you want out of it.

Congratulations, you went from querying 1.7 million documents, to querying a couple thousand instead.