Thread Reader
Jiayuan (Forrest)

Jiayuan (Forrest)
@Tisoga

Dec 4, 2023
23 tweets
Twitter

How https://t.co/AgfDhVEASK built an efficient RAG system 🔎 We promised to share the underlying technology involved in https://t.co/AgfDhVEASK, and this is the first in a series of threads that will share what we've done on this project. In addition, we have opened a GitHub Repo dedicated to submitting feedback and suggestions, welcome feedback. 🧵 https://t.co/hfsrXhNLdp

The full name of RAG is: Retrieval Augmented Generation Originally from a 2020 Facebook paper, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (yes, you read that right, this technology will be available in 2020). 1/22 https://t.co/jH1danxQjX
One of the problems this paper addresses is very simple: how to get language models to use external knowledge for generation. Typically, pre-train models have their knowledge stored in the parameters, which results in the model being unaware of knowledge outside the training set (e.g., search data, knowledge of the industry). The previous practice was to finetune the pre-train model again when new knowledge became available. 2/22
There are several problems with this approach: 1. we need to finetune every time we have new knowledge. 2. 2. the cost of training the model is very high. So this paper proposes a RAG approach, where the pre-trained model is able to understand the new knowledge, so we can just give it the new knowledge we want the model to understand by prompting it. 3/22
So a minimal RAG system is composed of 3 parts: 1. the language model 2. a collection of external knowledge (in the form of a vector) needed by the model 3. the external knowledge needed for the current scenario 4/22 https://t.co/saiozgUT3a
langchain, llama-index is essentially a RAG system (including agents built on top of RAGs, of course). If you understand the essence, there is no need to add an extra layer of abstraction, just build the system according to your business situation. For example, in order to maintain high performance, we use Go + Rust architecture, which can support highly concurrent RAG requests. 5/22
To simplify the problem, no matter what kind of RAG you build, optimizing this system means optimizing each of these 3 modules. 6/22
1) Language Modeling Why did this 2020 paper not catch fire until this year?A major reason is that the previous base model was not capable enough. If the underlying model is dumb, then even if it is given a wealth of external knowledge, the model will not be able to make inferences based on that knowledge. From some benchmarks in the paper, we can also see that the effect has been improved, but it is not particularly significant. 7/22 https://t.co/Q7Rs7o0afE
1.1) The emergence of GPT-3 makes RAG usable for the first time The first wave of companies based on RAG + GPT-3 have achieved very high valuations & ARR (Annual Recurring Revenue): - Copy AI - Jasper Both were products that built RAG in marketing and were once star AI unicorns, but of course valuations have shrunk dramatically since they've been demystified. 8/22
1.2) Since 2023, there have been a large number of open & closed source base models on which RAG systems can be built. The most common way is: - GPT-3.5/4 + RAG (closed source scheme) - Llama 2 / Mistral + RAG (open source solution) 9/22
2) External knowledge set needed for the model By now everyone should understand the embedding model, including the recall of embedded data. The essence of embedding is to transform the data into vectors and then find the two or more vectors that best match by cosine similarity. knowledge -> chunks -> vector user query -> vector 10/22 https://t.co/ZZZ5toDlH1
2.1) This module is divided into two parts: 1. the embedding model 2. a database for storing embedding vectors The former basically uses OpenAI's embedding model, while the latter has a lot of options, including Pinecone, Zilliz from a domestic team, the open source Chroma, and pgvector built on relational databases, etc. The former uses OpenAI's embedding model, while the latter uses OpenAI's embedding model. 11/22
(2.2) These companies making embedding databases have also received very high funding and valuation in this wave of AI Hype. But thinking in terms of first principles, the purpose of module 2 is to store an external collection of knowledge and recall it when needed. This step does not necessarily require embedding models, and traditional search matching may work better in some scenarios (Elasticsearch). 12/22
2.3) The approach used by https://t.co/AgfDhVEASK is embedding + traditional relation db + Elasticsearch. One idea is that the more work we do when encoding knowledge, the faster & more accurate we can be when retrieving (the difference between doing the work first & doing it later). 13/22
2.4) We used Rust to build the entire knowledge index. including: - GitHub code data - Development documentation data - Search engine data 14/22
3) Better Recall of External Knowledge Needed in the Current Scenario According to the law of prioritizing work, we do a lot of processing of raw knowledge data during encoding: - Program analysis of the code - Logical chunking of development documents. - Extraction of web page information & page ranking optimization 15/22
3.1) After doing the above work, we can ensure that the data we retrieve is structured and does not require much processing, and we can improve the accuracy of the recall. 16/22
Now look at this diagram from a16z, which extends each step with a corresponding component, the core essence remains the same. https://t.co/guXM4g72av 17/22 https://t.co/f2Svc2iEQX
In 2022, Perplexity, a search engine based on this RAG system, already has tens of millions of traffic per month, and LangChain has been valued at several hundred million dollars. 18/22
Whether it's a general purpose RAG or a proprietary RAG, this is an area where it's easy to be sloppy, but it's hard to be 90 out of 100. There is no best practice for each step, for example, embedding chunk size, whether need to connect to search engine, all need to try according to the actual business scenario. There are a lot of related papers, but not every method mentioned in the papers is useful. 19/22
Today is just a simple https://t.co/AgfDhVEASK bottom of some of the technology used to do a high level of popularization, not too in-depth technical details, the purpose is to hope that developers who want to enter this line of work can also be from the first principle to think about the technology to get rid of the charm. 20/22
The next phase will be a weekly LLM-related technology sharing tweets, today in writing this thread a lot of use of https://t.co/AgfDhVEASK to query related information, very helpful. 21/22 https://t.co/oFuEJkwV1J
If you still want to see related content, you can retweet this tweet for more people to see, and if there is any error in the expression of the above content, you are welcome to point it out in the comment section. We will summarize the detailed article at https://t.co/hfsrXhNLdp and welcome your star support (and more importantly, feedback from https://t.co/AgfDhVEASK!). 22/22
Jiayuan (Forrest)
Building @devv_ai, an AI-powered search engine for developers.
Follow on Twitter
Translations:English
Missing some tweets in this thread? Or failed to load images or videos? You can try to .