RAG stands for Retrieve-Augmented Generation, a technique that enhances the capabilities of language models by incorporating external information. Compared to pure Language Models (LLMs), RAG offers several advantages:
More Precise Answers: By retrieving relevant information from a knowledge base, RAG can provide more accurate and contextually appropriate responses.
Reduced Hallucinations: RAG reduces the likelihood of generating false or irrelevant information by grounding its outputs in real-world data.
Access to Latest Information: RAG can utilize up-to-date external sources, ensuring that the generated content is current.
RAG Structure
The typical structure of a RAG system includes:
Vector Database: This stores the knowledge base as vectors for efficient retrieval.
Embedding Model: Converts text from the knowledge base into vector representations.
Language Model (LLM): Generates responses based on retrieved context and its own understanding.
RAG Process
The process of RAG involves two main steps:
Indexing: The knowledge base is indexed using an embedding model, converting textual information into vectors for storage in a vector database.
Query and Response Generation: When a query is received, the system retrieves relevant context from the vector database and uses it to generate a response with the help of the LLM.
Open-Source Solutions
Several open-source frameworks are available for implementing RAG:
LLamaIndex: Connects to various data sources and converts data into vectors using an embedding model.
HayStack
LangChain
GraphRAG: A logic reasoning framework that supports multi-hop fact questions.
KAG: An OpenSPG engine-based framework for building domain-specific knowledge bases.
AnythingLLM: Provides a chat interface to convert documents into context for LLMs.
MaxKB: An open-source knowledge base问答系统 used in customer service, internal knowledge bases, and education.
RAGFlow: A deep document understanding-based RAG engine that provides reliable Q&A services.
FastGPT: A knowledge base问答 system with data processing and model calling capabilities.
Langchain-Chatchat: Local knowledge base Q&A based on Langchain and ChatGLM.
FlashRAG: A Python toolset for reproducing and developing RAG research, including 36 preprocessed benchmark datasets and 15 advanced algorithms.
Open WebUI(前身为Ollama WebUI)是一个可扩展的、功能丰富的、用户友好的自托管Web界面,设计用于完全离线运行。它支持各种LLM(大型语言模型)运行器,包括Ollama和兼容OpenAI的API。
Vector Database
Qdrant: A fast, Rust-based vector database focusing on performance and efficiency.
Chroma: Popular for its simplicity and ease of use.
Embedding Model
Sentence Transformers: A library for generating high-quality sentence embeddings.
BERT (Bidirectional Encoder Representations from Transformers): BERT is a transformer-based model known for its ability to understand context by pre-training on a large corpus of text.
Language Models (LLMs)
llama3.3
Gemma2
Cloud Solutions
Cloud-based services that support RAG include:
Google Vertex AI Matching Engine: Provides vector search and a managed RAG solution for enterprise search.
AWS Kendra + Sagemaker/Bedrock: Combines with Kendra for enterprise search and Bedrock for LLMs to build RAG solutions.
Azure AI Search + Azure OpenAI Service: Offers vector search and integrates well with Azure OpenAI Service for building RAG applications.
火山方舟大模型服务平台: A platform for large language models.
腾讯云ES: Based on the ElasticSearch ecosystem.
阿里云PAI: Guide to deploying a RAG-based dialogue system.
本地使用的时候,需要先运行ollama serve打开服务,然后再运行某一个模型,例如ollama run llama3.3,Ollama会先去拉取llama3.3,之后再去运行,会给出下面的对话框:
1 2
$: ollama run llama3.3 >>> Send a message (/? for help)
不同模型占用的显存不一样,可以通过ollama ps来查看目前模型运行的情况:
1 2 3
~$ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.3:latest a6eb4748fd29 45 GB 46%/54% CPU/GPU 4 minutes from now
可以看到模型有54%是跑在GPU上,46%跑在CPU上。
也可以通过ollama list来查看目前有哪些模型
1 2 3 4 5 6 7 8 9 10
~$ ollama list NAME ID SIZE MODIFIED qwen2.5:32b 9f13ba1299af 19 GB 7 days ago llava:latest 8dd30f6b0cb1 4.7 GB 10 days ago codegemma:latest 0c96700aaada 5.0 GB 10 days ago internlm2:20b a864ac8dade2 11 GB 10 days ago internlm2:latest 5050e36678ab 4.5 GB 10 days ago glm4:latest 5b699761eca5 5.5 GB 10 days ago deepseek-coder-v2:latest 63fb193b3a9b 8.9 GB 12 days ago wizardcoder:latest de9d848c1323 3.8 GB 12 days ago
# Secondly, you can find your NVIDIA driver's version by following command ls /usr/src | grep nvidia # 3.Lastly, you should install the corresponding version by following command sudo apt install dkms sudo dkms install -m nvidia -v xxx # https://stackoverflow.com/questions/67071101/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver-mak
Maximum local bitrate (kbits/sec). Will be used only if vbv-bufsize is also non-zero. Both vbv-bufsize and vbv-maxrate are required to enable VBV in CRF mode. Default 0 (disabled)
—vbv-bufsize: Specify the size of the VBV buffer (kbits). Enables VBV in ABR mode. In CRF mode, --vbv-maxrate must also be specified. Default 0 (vbv disabled)
—vbv-init:
Initial buffer occupancy. The portion of the decode buffer which must be full before the decoder will begin decoding. Determines absolute maximum frame size. May be specified as a fractional value between 0 and 1, or in kbits. In other words, these two option pairs are equivalent:
vbv-bufsize / vbv-maxrate代表了缓冲几秒的视频;
—vbv-end:
Final buffer fullness. The portion of the decode buffer that must be full after all the specified frames have been inserted into the decode buffer.
—min-vbv-fullness
Minimum VBV fullness percentage to be maintained. Specified as a fractional value ranging between 0 and 100. Default 50 i.e, Tries to keep the buffer at least 50% full at any point in time.
Decreasing the minimum required fullness shall improve the compression efficiency, but is expected to affect VBV conformance. Experimental option.
—max-vbv-fullness
Maximum VBV fullness percentage to be maintained. Specified as a fractional value ranging between 0 and 100. Default 80 i.e Tries to keep the buffer at max 80% full at any point in time.
Increasing the minimum required fullness shall improve the compression efficiency,
DCR(Differential Comparison Rating): 每次评估一对视频(源视频,处理后视频),对处理后视频相对于源视频的差异进行打分;(Imperceptible, Perceptible but not annoying, Slightly annoying, Annoying, very annoying)
PC/CCR: 每次评估一对视频(都是处理后的视频,对比两者的处理效果),给出哪个视频更好的打分;打分的量级分为(Much worse, worse, slightly worse, the same, slightly better, Better, Much Better)
打分之后一般使用MOS(Mean Opinion Score)或者DMOS(Differential Mean Opinion Score)来统计出分数来;MOS常用与ACR,分数越高,质量越好;而DMOS常用于DCR,分数越低,代表与原视频越接近).
-sharpness: filter sharpness, 0: most sharp, 7: 最小sharp
-strong: use strong filter
-sharp_yuv: use sharper RGB->YUV的转换
-partition_limit
-pass: analysis pass number
-qrange,指定最小和最大QP的范围
-mt: 使用multi-threading
-alpha_method: 指定透明通道压缩的方法
-alpha_filter: predictive filtering for alpha plane
-exact: preserve RGB value in transparent area; 也就是不修改完全透明区域的
-noalpha: 丢掉alpha的特征
-lossless: 无损压缩图像
-near_lossless: use near-lossless image preprocessing
-hint: 指定图片的特征或者提示;
-af: using autoaf,指标是ssim
如何把其他图片转换成Webp?
JPG format YUVJ420: YUV420P with JPEG color-range" (i.e. pixels use 0-255 range instead of 16-235 range, where 0 instead of 16 is black, and 255 instead of 235 is white);
Decoder perspective: 9.4 Loop Filter Type and Levels frame-level loop filter params: 1. filter_type: normal or simple, simple filter only applied to luma edges;
2. loop_filter_level: defines the threshold, if the different is below this threshold, it should be filtered, otherwise, it should not be changed. Usually the level correlated to quantizer levels.
3. sharpness_level, constant over the frame, 如果loop_filter_level=0,那么应该跳过loop_filter;
Differences in excess of a threshold(associated to the loop_filter_level) are assumed to not being modified.
mode_ref_lf_delta_update=1, allow per-macroblock adjustment, through delta; There are two types of delta values, one for reference frame-based adjustment, and the other group is for mode-based adjustment.
Filter header to write into bitstreams: simple: level: 对应fstrength_ sharpness