GitHub - jaynewcompute/unstructuredcapabilities: a collection of resources and code pointed at the problem of making unstructured data useful

Mission

Bring structure to the world's unstrucutured data

Problem Statement

Natural language reasoning engines, like LLMs, are the most important technological development of our lifetimes, for a few reasons:

Reasoning is the foundation of all leverage
LLMs will likely be able to reason about, and virtualize, a wide variety of important systems (e.g. physics, biology, chemistry, people, computers, the mess of the healthcare industry, etc) through the interface of natural language
Making all of the world's knowledge semantically searchable, to accelerate that reasoning, will democratize access to information by orders of magnitude

If we're on the same page about the above, the next question is, how do we get there?

That's becoming more and more clear: the biggest bottleneck to more powerful, democratized artificial reasoning is the availability of high quality data. We can't sit around and wait for algorithmic improvements, and even when they come, the algorithms need to get their hands on as much specialized data as they can. And, the majority of mission-critical, useful data is unstructured, especially the types that AI is hungry for. In other words, the importance of increasing the amount of useful unstructured data is going exponential.

"Progress in science depends on new techniques, new discoveries and new ideas, probably in that order" - Sidney Brenner

Our goal is to activate and mobilize the ML & data community, the most important community in the world's most important race. If we want to bring the future forward, if we want to point AI at society's most important and specialized problems, we need to coordinate and build a new set of standards and tools. If we want to cure diseases, unbreak the economy, fix global crises, and make scientific breakthroughs, we need unstructured data to meet AI closer to where it's at.

It's not going to be easy, it's not going to happen over night, but needles are going to be moved.

A Few Things We Believe

We can improve the quality of the majority of data in the wolrd by working together to create new standards (~80% of data is unstructured)
Every team will need a data engineer, or have some one who can effectively function as one
Building better tools will close the gap faster than convincing more people to specialize in data engineering
Open, community-led collaborative efforts are the way forward
The data community holds untapped potential
Dataism - data wants to be together, to be unified and analyzed as one

Some External Validation

Inspiration

The Airbyte community, and the tools they've built for strucured data
LlamaIndex & LangChain communities, and the frameworks they've built to make LLM apps more accessible
Unstructured.io community, and their work with the public sector

Principles

Modular, easily extensible, low opinion.
Maximize integrations, not by casting a wide and shallow net, but through deep value and useful, generalizable standards
Maximize the semantic searchability and semantic control of data

Resources & Insights

Llama Index ingestors
Underrated open-source data tools
Operational simplicity in streaming data
Agent benefits from LlamaIndex
Chunking strategies
The future of search: Augmenting existing data & Metaphor Systems
Building data lakehouses pragmatically
Enhancing Python with Buildflow
At the crossroads of open-source & embeddings: txtai, milvus.io, weaviate.io, milvus.io (again), vectorflow
Downsides of VectorDBs: Issues, Tunnel vision in AI, Understanding vector embeddings
Progress in similarity search algorithms
Discovery system design for retrieval and recommendation engines
Data ingestion techniques for LLM and LangChain

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mission

Problem Statement

A Few Things We Believe

Some External Validation

Inspiration

Principles

Resources & Insights

About

Releases

Packages

License

jaynewcompute/unstructuredcapabilities

Folders and files

Latest commit

History

Repository files navigation

Mission

Problem Statement

A Few Things We Believe

Some External Validation

Inspiration

Principles

Resources & Insights

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages