Skip to content

a collection of resources and code pointed at the problem of making unstructured data useful

License

Notifications You must be signed in to change notification settings

jaynewcompute/unstructuredcapabilities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 

Repository files navigation

Mission

Bring structure to the world's unstrucutured data

Problem Statement

Natural language reasoning engines, like LLMs, are the most important technological development of our lifetimes, for a few reasons:

  1. Reasoning is the foundation of all leverage
  2. LLMs will likely be able to reason about, and virtualize, a wide variety of important systems (e.g. physics, biology, chemistry, people, computers, the mess of the healthcare industry, etc) through the interface of natural language
  3. Making all of the world's knowledge semantically searchable, to accelerate that reasoning, will democratize access to information by orders of magnitude

If we're on the same page about the above, the next question is, how do we get there?

That's becoming more and more clear: the biggest bottleneck to more powerful, democratized artificial reasoning is the availability of high quality data. We can't sit around and wait for algorithmic improvements, and even when they come, the algorithms need to get their hands on as much specialized data as they can. And, the majority of mission-critical, useful data is unstructured, especially the types that AI is hungry for. In other words, the importance of increasing the amount of useful unstructured data is going exponential.

"Progress in science depends on new techniques, new discoveries and new ideas, probably in that order" - Sidney Brenner

Our goal is to activate and mobilize the ML & data community, the most important community in the world's most important race. If we want to bring the future forward, if we want to point AI at society's most important and specialized problems, we need to coordinate and build a new set of standards and tools. If we want to cure diseases, unbreak the economy, fix global crises, and make scientific breakthroughs, we need unstructured data to meet AI closer to where it's at.

It's not going to be easy, it's not going to happen over night, but needles are going to be moved.

A Few Things We Believe

  • We can improve the quality of the majority of data in the wolrd by working together to create new standards (~80% of data is unstructured)
  • Every team will need a data engineer, or have some one who can effectively function as one
  • Building better tools will close the gap faster than convincing more people to specialize in data engineering
  • Open, community-led collaborative efforts are the way forward
  • The data community holds untapped potential
  • Dataism - data wants to be together, to be unified and analyzed as one

Some External Validation

Inspiration

  • The Airbyte community, and the tools they've built for strucured data
  • LlamaIndex & LangChain communities, and the frameworks they've built to make LLM apps more accessible
  • Unstructured.io community, and their work with the public sector

Principles

  • Modular, easily extensible, low opinion.
  • Maximize integrations, not by casting a wide and shallow net, but through deep value and useful, generalizable standards
  • Maximize the semantic searchability and semantic control of data

Resources & Insights

About

a collection of resources and code pointed at the problem of making unstructured data useful

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published