Bring structure to the world's unstrucutured data
Natural language reasoning engines, like LLMs, are the most important technological development of our lifetimes, for a few reasons:
- Reasoning is the foundation of all leverage
- LLMs will likely be able to reason about, and virtualize, a wide variety of important systems (e.g. physics, biology, chemistry, people, computers, the mess of the healthcare industry, etc) through the interface of natural language
- Making all of the world's knowledge semantically searchable, to accelerate that reasoning, will democratize access to information by orders of magnitude
If we're on the same page about the above, the next question is, how do we get there?
That's becoming more and more clear: the biggest bottleneck to more powerful, democratized artificial reasoning is the availability of high quality data. We can't sit around and wait for algorithmic improvements, and even when they come, the algorithms need to get their hands on as much specialized data as they can. And, the majority of mission-critical, useful data is unstructured, especially the types that AI is hungry for. In other words, the importance of increasing the amount of useful unstructured data is going exponential.
"Progress in science depends on new techniques, new discoveries and new ideas, probably in that order" - Sidney Brenner
Our goal is to activate and mobilize the ML & data community, the most important community in the world's most important race. If we want to bring the future forward, if we want to point AI at society's most important and specialized problems, we need to coordinate and build a new set of standards and tools. If we want to cure diseases, unbreak the economy, fix global crises, and make scientific breakthroughs, we need unstructured data to meet AI closer to where it's at.
It's not going to be easy, it's not going to happen over night, but needles are going to be moved.
- We can improve the quality of the majority of data in the wolrd by working together to create new standards (~80% of data is unstructured)
- Every team will need a data engineer, or have some one who can effectively function as one
- Building better tools will close the gap faster than convincing more people to specialize in data engineering
- Open, community-led collaborative efforts are the way forward
- The data community holds untapped potential
- Dataism - data wants to be together, to be unified and analyzed as one
- 2/3rd of company data goes unused
- The volume of data is EXPLODING
- Data scientists spend too much time organizing, unifying, and cleaning data
- 90% of the world's data will be unstructured by 2025
- We might run out of data soon...
- Data as the biggest bottleneck in AI
- Data quality as the biggest bottleneck in AI
- The Airbyte community, and the tools they've built for strucured data
- LlamaIndex & LangChain communities, and the frameworks they've built to make LLM apps more accessible
- Unstructured.io community, and their work with the public sector
- Modular, easily extensible, low opinion.
- Maximize integrations, not by casting a wide and shallow net, but through deep value and useful, generalizable standards
- Maximize the semantic searchability and semantic control of data
- Llama Index ingestors
- Underrated open-source data tools
- Operational simplicity in streaming data
- Agent benefits from LlamaIndex
- Chunking strategies
- The future of search: Augmenting existing data & Metaphor Systems
- Building data lakehouses pragmatically
- Enhancing Python with Buildflow
- At the crossroads of open-source & embeddings: txtai, milvus.io, weaviate.io, milvus.io (again), vectorflow
- Downsides of VectorDBs: Issues, Tunnel vision in AI, Understanding vector embeddings
- Progress in similarity search algorithms
- Discovery system design for retrieval and recommendation engines
- Data ingestion techniques for LLM and LangChain