Cursor与AI插件在复杂任务下的对比

发表于 2025-05-18 更新于 2025-05-19 分类于 AI 本文字数： 3.5k 阅读时长 ≈ 6 分钟

Introduction

之前一直想尝试Cursor，有各种事情耽搁没做，正好周末有时间，把Cursor装起来体验了一番。因为我之前已经比较频繁使用AI代码插件，这次正好一起对比下；我之前主要是使用VS Code的Continue，多用于写一些小的脚本，发现插件使用起来效果也都不错。这次对比，主要是针对更复杂和更高级的任务，看哪一个表现更好；

先说结论，Cursor的使用体验确实要好过AI插件一大截，基本上是0和1的区别；我的测试方法如下，找一个开源的项目，问一些全局的，高层的问题，看哪个回答的更好；我选用的代码库是libaom，它是AV1的编码器，非常复杂；很适合测试这个任务；对于Cursor，使用的大模型是Sonet 3.5，

问题设计

我问了一下三个问题: 1. Which function is the entry points for partition search? 2. 哪一个文件主要用于做global motion的搜索？ 3. 哪一个文件用来控制不同的速度档次下打开或者关闭某些快速算法？可以看到，这三个问题是非常high-level的问题，并且我没有给出具体的Context，所以要靠Cursor或者Continue插件自己来建立合适的Context；

Cursor

对于问题1，Cursor的回答如下:

Let me search through the codebase to find the entry points for partition search in AV1.
 **Search codebase for "partition serach entry point cuntion av1 encoder". No results found**
 Let me try a more specific search using grep to find partition-related functions.    
 **Grepped codebase for "rd_pick_partition"**

Based on the search results, I can identify the main entry points for partition search in AV1. There are two primary functions:

1. av1_rd_pick_partition() - This is the main rate-distortion (RD) based partition search function, located in partition_search.c. This is used when RD optimization is enabled.

2. av1_nonrd_pick_partition() - This is the non-RD based partition search function, also in partition_search.c. This is a faster version used when RD optimization is disabled.

Let me get more details about these functions to provide you with a better understanding.
 **Read partition_search.c L5457-5674**
 The main entry point is av1_rd_pick_partition() in partition_search.c. This function performs a thorough rate-distortion optimization to find the best partition structure. It's used when quality is prioritized over encoding speed. The function signature is:

其中**中间的内容是Cursor执行的动作，从这里面可以看到，Cursor会做对应的代码搜索，找到相关的内容，给出进一步的解释；这里面Cursor搜索的结果非常令人吃惊，因为我并没有给出具体的函数名字，例如rd_pick_partition，而是近似的名字"partition search"，他还是准确找到了相关的函数；

对于问题2， Cursor的回答如下:

1
2
3

让我帮你找一下与global motion搜索相关的文件。
**Searched codebase for "global motion search".**
根据搜索结果，我可以看到主要用于global motion搜索的文件是av1/encoder/global_motion_facade.c。这个文件包含了全局运动估计的主要实现。让我解释一下关键部分：

同样，Cursor精确地找到了相关的文件，并给出了对应的解释；

对于问题3，Cursor的回答如下:

让我搜索一下与速度档次和算法控制相关的代码
** Searched codebase for "speed settings control algorithm enabled disable" **
根据搜索结果,我发现在 av1/encoder/speed_features.h 和 av1/encoder/speed_features.c 中定义了控制不同速度档次下算法开关的代码。主要的控制结构如下:

在 speed_features.h 中定义了多个结构体来组织不同类型的速度特性:
HIGH_LEVEL_SPEED_FEATURES: 高层速度特性
FIRST_PASS_SPEED_FEATURES: 第一遍编码的速度特性
PARTITION_SPEED_FEATURES: 分区相关的速度特性
MV_SPEED_FEATURES: 运动估计相关的速度特性
INTER_MODE_SPEED_FEATURES: 帧间预测模式相关的速度特性
INTERP_FILTER_SPEED_FEATURES: 插值滤波器相关的速度特性
INTRA_MODE_SPEED_FEATURES: 帧内预测模式相关的速度特性

可以看到Cursor不仅找到了对应的问题，而且把相关的文件和结构体都列出来做了总结；

Continue.dev:

使用同样的Claude 3.7 Sonnet的模型，问了同样的问题，只是我在chat 中手动添加了@codebase，希望能给一些提示；但等了很久没有反馈，一直显示"Generating";

之后我尝试选择"Agent" mode，还是同样的结果；我检查CPU使用率，发现VS Code确实有在运行，我怀疑是在database中做矢量搜索，可能本地检索太慢，一直没有返回相关的context给大模型做生成；

我也有尝试使用本地Ollama的模型，但本地模型context比较有限(qwen3大概40k)，非常容易超出context的限制；

Aidar

Aidar有去使用repo map，根据tree sitter会对整个代码库做解析，理论上应该会更好；让我们来试下：

export ANTHROPIC_API_KEY=xxx
aider --model anthropic/claude-3-5-sonnet-latest
# Enter the command line of aider
/chat-mode ask
Which function is the entry points for partition search?

Aidar的动作是:

**Update repo-map**
From the files shown so far, I cannot definitively determine the main entry point for partition search. While I can see
references to partition-related code like partition_search_type and variance_partition_alloc(), the actual partition search entry
point is likely in files that haven't been shared yet.

有意思的是，Aidar会持续要求加入一些新的context，例如一些文件，但看起来最终没有找到相关的函数；看起来即使有repo map，但Aidar的能力也还是要差不少。

Obsidian：程序猿的神器，助你高效写作与分享

发表于 2025-01-01 更新于 2025-05-19 分类于 AI ， Tools 本文字数： 1.1k 阅读时长 ≈ 2 分钟

作为一个程序员，日常记录和内容创作是必不可少的。我也曾尝试过多种笔记工具，但都未能完全满足我的需求。Notion虽然比Evernote更强大，但在将笔记转化为内容方面仍显得有些繁琐。近年来，AI辅助写作逐渐兴起，但我对Notion AI的体验并不满意，尤其是它的收费模式。直到有一天，我偶然发现了Obsidian——一款真正让我爱不释手的神器。

博客发表

我的博客采用Hexo框架，文章以Markdown格式保存于_source目录中。而Obsidian同样支持Markdown格式，因此两者无缝对接，无需额外转换。

自动化部署

目前，我将博客托管在GitHub上，并使用Cloudflare进行静态网站托管。在Cloudflare上设置跟踪master分支后，一旦有更新，便会自动构建网站并进行更新。

操作步骤：

使用ln命令将网站的_source目录软连接到Obsidian的vault中：
1
ln -s /my/website/_source blogs
编辑完博客后，通过右上角的move file to功能将其移动至blog目录。
通过git提交更改。

注意： 博客头部需添加metadata，否则可能出现问题。以下是一个示例格式：

---
title: Hello World
date: 2025-01-01
tags:
  - Test
categories:
  - Debug
---

AI辅助写作

为了进一步提升写作效率，我安装了Copilot插件。该插件支持多种模型，既可以是云服务，也可以是本地通过Ollama提供的。

配置方法：

打开Copilot的setting，选择Add Custom Model, 输入模型的名字，例如qwen2.5:32b，模型的名字可以通过在服务器端使用ollama list获得；Provider选择Ollama，填入Base URL，例如http://localhost:11434. 如果是在远程服务器上部署的ollama，则可以通过端口转发来在本地建立服务

1	ssh -L 11434:localhost:11434 -fNT remoteserver -o IdentitiesOnly=yes -q -A -T

使用方法：

配置完成后，通过Cmd + P打开命令面板。
选择Copilot: Open Copilot Chat Window，默认会在Sidebar中显示chat窗口。
特别之处在于，你可以通过[[选项引用note，将其作为context进行写作，极大地提高了效率。

Obsidian不仅是一款优秀的笔记工具，更是一个强大的AI辅助写作平台。它让我在编程之余，也能轻松地分享我的知识和经验。如果你也是一位程序员，不妨试试这款神器吧！

Introduction to RAG

发表于 2025-01-01 更新于 2025-05-19 分类于 AI 本文字数： 3.1k 阅读时长 ≈ 6 分钟

What is RAG?

RAG stands for Retrieve-Augmented Generation, a technique that enhances the capabilities of language models by incorporating external information. Compared to pure Language Models (LLMs), RAG offers several advantages:

More Precise Answers: By retrieving relevant information from a knowledge base, RAG can provide more accurate and contextually appropriate responses.
Reduced Hallucinations: RAG reduces the likelihood of generating false or irrelevant information by grounding its outputs in real-world data.
Access to Latest Information: RAG can utilize up-to-date external sources, ensuring that the generated content is current.

RAG Structure

The typical structure of a RAG system includes:

Vector Database: This stores the knowledge base as vectors for efficient retrieval.
Embedding Model: Converts text from the knowledge base into vector representations.
Language Model (LLM): Generates responses based on retrieved context and its own understanding.

RAG Process

The process of RAG involves two main steps:

Indexing: The knowledge base is indexed using an embedding model, converting textual information into vectors for storage in a vector database.
Query and Response Generation: When a query is received, the system retrieves relevant context from the vector database and uses it to generate a response with the help of the LLM.

Open-Source Solutions

Several open-source frameworks are available for implementing RAG:

LLamaIndex: Connects to various data sources and converts data into vectors using an embedding model.
HayStack
LangChain
GraphRAG: A logic reasoning framework that supports multi-hop fact questions.
KAG: An OpenSPG engine-based framework for building domain-specific knowledge bases.
AnythingLLM: Provides a chat interface to convert documents into context for LLMs.
MaxKB: An open-source knowledge base问答系统 used in customer service, internal knowledge bases, and education.
RAGFlow: A deep document understanding-based RAG engine that provides reliable Q&A services.
FastGPT: A knowledge base问答 system with data processing and model calling capabilities.
Langchain-Chatchat: Local knowledge base Q&A based on Langchain and ChatGLM.
FlashRAG: A Python toolset for reproducing and developing RAG research, including 36 preprocessed benchmark datasets and 15 advanced algorithms.
Open WebUI（前身为Ollama WebUI）是一个可扩展的、功能丰富的、用户友好的自托管Web界面，设计用于完全离线运行。它支持各种LLM（大型语言模型）运行器，包括Ollama和兼容OpenAI的API。

Vector Database

Qdrant: A fast, Rust-based vector database focusing on performance and efficiency.
Chroma: Popular for its simplicity and ease of use.

Embedding Model

Sentence Transformers: A library for generating high-quality sentence embeddings.
BERT (Bidirectional Encoder Representations from Transformers): BERT is a transformer-based model known for its ability to understand context by pre-training on a large corpus of text.

Language Models (LLMs)

llama3.3
Gemma2

Cloud Solutions

Cloud-based services that support RAG include:

Google Vertex AI Matching Engine: Provides vector search and a managed RAG solution for enterprise search.
AWS Kendra + Sagemaker/Bedrock: Combines with Kendra for enterprise search and Bedrock for LLMs to build RAG solutions.
Azure AI Search + Azure OpenAI Service: Offers vector search and integrates well with Azure OpenAI Service for building RAG applications.
火山方舟大模型服务平台: A platform for large language models.
腾讯云ES: Based on the ElasticSearch ecosystem.
阿里云PAI: Guide to deploying a RAG-based dialogue system.

Reference

火山方舟大模型服务平台
 Learn how to deploy a RAG-based dialogue system on Aliyun PAI
详解几种常见本地大模型个人知识库工具部署、微调及对比选型
 RAG框架总结主流框架推荐
 搭建个人 RAG 推理服务

Introduce to Ollama

发表于 2025-01-01 更新于 2025-05-19 分类于 Technology 本文字数： 2.1k 阅读时长 ≈ 4 分钟

Ollama是在个人PC上运行大模型的一个推理软件，使用简单，适合新手。

Ollama安装

打开终端，运行

1	curl -fsSL https://ollama.com/install.sh \| sh

国内访问比较慢，需要使用代理；

如何使用

本地使用的时候，需要先运行ollama serve打开服务，然后再运行某一个模型，例如ollama run llama3.3，Ollama会先去拉取llama3.3，之后再去运行，会给出下面的对话框:

1 2	$: ollama run llama3.3 >>> Send a message (/? for help)

不同模型占用的显存不一样，可以通过ollama ps来查看目前模型运行的情况:

1
2
3

~$ ollama ps
NAME               ID              SIZE     PROCESSOR          UNTIL
llama3.3:latest    a6eb4748fd29    45 GB    46%/54% CPU/GPU    4 minutes from now

可以看到模型有54%是跑在GPU上，46%跑在CPU上。

也可以通过ollama list来查看目前有哪些模型

~$ ollama list
NAME                        ID              SIZE      MODIFIED
qwen2.5:32b                 9f13ba1299af    19 GB     7 days ago
llava:latest                8dd30f6b0cb1    4.7 GB    10 days ago
codegemma:latest            0c96700aaada    5.0 GB    10 days ago
internlm2:20b               a864ac8dade2    11 GB     10 days ago
internlm2:latest            5050e36678ab    4.5 GB    10 days ago
glm4:latest                 5b699761eca5    5.5 GB    10 days ago
deepseek-coder-v2:latest    63fb193b3a9b    8.9 GB    12 days ago
wizardcoder:latest          de9d848c1323    3.8 GB    12 days ago

作为服务运行

有时候会把Ollama作为一个服务运行，这样可以免除每次都执行ollama server的操作；可以通过一下方式实现:

1	sudo systemctl edit ollama.service

添加以下内容

1 2	# [Service] # Environment="OLLAMA_HOST=0.0.0.0"

通过指定HOST为0.0.0.0，这样局域网内其他机器也可访问。之后再运行下面命令

1
2
3

sudo systemctl daemon-reload 
sudo systemctl restart ollama 
sudo systemctl status ollama.service

如果系统显示下面信息，则说明ollama正常运行了:

● ollama.service - Ollama Service 
Loaded: loaded (/etc/systemd/system/ollama.service; enabled; vendor preset: enabled) 
Drop-In: /etc/systemd/system/ollama.service.d 
         └─override.conf 
Active: active (running) since Fri 2024-12-20 07:40:24 CST; 6min ago 
Main PID: 1241 (ollama) 
Tasks: 15 (limit: 37306) 
Memory: 139.7M 
CGroup: /system.slice/ollama.service 
        └─1241 /usr/local/bin/ollama serve

Troubleshoot

Ubuntu系统自动升级后发现Ollama运行很慢，通过ollama ps发现推理都在CPU上面进行；再运行nvidia-smi，发现找不到显卡；通过以下手段进行修复：

# Secondly, you can find your NVIDIA driver's version by following command 
ls /usr/src | grep nvidia 
# 3.Lastly, you should install the corresponding version by following command sudo apt install dkms 
sudo dkms install -m nvidia -v xxx 
# https://stackoverflow.com/questions/67071101/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver-mak

很多插件(VSCode中的Continue)都使用本地的Ollama，如果Ollama设置在服务端，需要通过端口映射来建立本地的Ollama服务
1
ssh -L 11434:localhost:11434 -fNT remoteserver -o IdentitiesOnly=yes -q -A -T

Introduce to VBV

发表于 2024-11-24 更新于 2025-05-19 分类于 Multimedia 本文字数： 3.9k 阅读时长 ≈ 7 分钟

VBV的原理

Video Buffer Verifier是编码器内部的一个模块，用来确保解码器可以smooth播放视频。通常会假设decoder内部有一块buffer，专门用于缓冲输入码流，这块buffer的size是固定的。我们先考虑下解码端的情况，再来看下反推到编码端是怎么样的

解码端是从网络/CD中读入码流，读入的速度可以是固定的，也可以是变化的，我们先考虑固定的情况。假如$R_{dec}^{in}$是解码器读入码流的速度，解码器把码流存储在码流Buffer中，Buffer的size是$S_{buf}$，另外一方面，当Buffer中的数据够一帧的数据时，解码器就会把整帧的数据一次性消耗掉。

理论上来说，当Buffer中没有空间来缓存数据时，这种情况就叫Buffer Overflow，这种情况是可能是因为$R_{dec}^{out}$太小导致; 另外一种情况是，当Buffer中没有数据的时候，就叫Buffer Underflow。这种情况是$R_{dec}^{out}$太快，大于$R_{dec}^{in}$，导致没有数据；

以上是解码端的情况，VBV本身就是模拟下解码端的情况，从而来确保编码端的码流符合规范；$R_{enc}^{in}$代表送给编码端VBV的速率，$R_{enc}^{out}$代表从VBV流出的码率；注意 $R_{enc}^{out} = R_{dec}^{in},\ R_{enc}^{in} = R_{dec}^{out}$

因此，当$R_{enc}^{in} > R_{enc}^{out}$时，对应的是$R_{dec}^{out} > R_{dec}^{int}$，也就是码流比较高的时候，会导致Buffer underflow;

当$R_{enc}^{in} < R_{enc}^{out}$是，对应解码器端 $R_{dec}^{in} < R_{dec}^{out}$，也就是实际编码码率较低的时候，会导致Buffer overflow.

以上就是vbv的基本原理，以下介绍下264/265里面的设置。

X264’s VBV setting

--vbv-maxrate:

Maximum local bitrate (kbits/sec). Will be used only if vbv-bufsize is also non-zero. Both vbv-bufsize and vbv-maxrate are required to enable VBV in CRF mode. Default 0 (disabled)

—vbv-bufsize: Specify the size of the VBV buffer (kbits). Enables VBV in ABR mode. In CRF mode, --vbv-maxrate must also be specified. Default 0 (vbv disabled)

—vbv-init:

Initial buffer occupancy. The portion of the decode buffer which must be full before the decoder will begin decoding. Determines absolute maximum frame size. May be specified as a fractional value between 0 and 1, or in kbits. In other words, these two option pairs are equivalent:

vbv-bufsize / vbv-maxrate代表了缓冲几秒的视频；

—vbv-end:

Final buffer fullness. The portion of the decode buffer that must be full after all the specified frames have been inserted into the decode buffer.

—min-vbv-fullness

Minimum VBV fullness percentage to be maintained. Specified as a fractional value ranging between 0 and 100. Default 50 i.e, Tries to keep the buffer at least 50% full at any point in time.

Decreasing the minimum required fullness shall improve the compression efficiency, but is expected to affect VBV conformance. Experimental option.

—max-vbv-fullness

Maximum VBV fullness percentage to be maintained. Specified as a fractional value ranging between 0 and 100. Default 80 i.e Tries to keep the buffer at max 80% full at any point in time.

Increasing the minimum required fullness shall improve the compression efficiency,

m_bufferFillFinal: 代表了已经填充的buffer(因为在模拟解码端的处理？)，也就相等于编码器端的可用空间

// Initial
if (m_param->rc.vbvBufferInit > 1.)
    m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, m_param->rc.vbvBufferInit / m_param->rc.vbvBufferSize);
if (m_param->vbvBufferEnd > 1.)
    m_param->vbvBufferEnd = x265_clip3(0.0, 1.0, m_param->vbvBufferEnd / m_param->rc.vbvBufferSize);
if (m_param->vbvEndFrameAdjust > 1.)
    m_param->vbvEndFrameAdjust = x265_clip3(0.0, 1.0, m_param->vbvEndFrameAdjust);

m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, X265_MAX(m_param->rc.vbvBufferInit, m_bufferRate / m_bufferSize));
m_bufferFillFinal = m_bufferSize * m_param->rc.vbvBufferInit;

// Update
m_bufferFillFinal -= frame_bits;
m_bufferFillFinal += rce->bufferRate;

AV1

rc→starting_buffer_level: 代表了已经填充的Buffer(从解码器的角度来看)，相当于编码端的可用空间；

rc→buffer_level: 代表了已经填充Buffer（从解码端的角度来看),从编码器的角度来看，代表了可用的buffer空间

正常对它的处理如下:

// Initialize
p_rc->starting_buffer_level = starting * bandwidth / 1000;
if (p_rc->starting_buffer_level >= p_rc->maximum_buffer_size) {
  p_rc->starting_buffer_level = p_rc->maximum_buffer_size * init_level / 1000;
}

rc->buffer_level = p_rc->starting_buffer_level;

// Update, 
int64_t new_buffer_lvl = rc->buffer_level - rc->projected_frame_size;
if (cpi->common.show_frame) new_buffer_lvl += rc->avg_frame_bandwidth;

Initialize:

对VBV的一些疑问

VBV假设$R_{dec}^{in}$是固定的，这个假设是对的吗？一般对于VCD/DVD/Blu.ray来说，可能需要读取fixed bitrate，而此时这个假设是可以；如果没有这个限制，譬如说live streaming，此时读取的内容就是发送的数据，发送的数据即使少于码率，也不会有Buffer Overflow发生。
In video streaming， buffer overflow是不大会出现，可以忽略。这是因为不可能出现解码器的输入码率不可能大于解码器消耗的码率。因为输入码率是等于$Min(bandwidth, R_{enc}^{out})$，

而$R_{enc}^{out} =R_{dec}^{out}$，因此必然会有$R_{dec}^{in} ≤ R_{dec}^{out}$。

Reference

https://www.youtube.com/watch?v=-Q7BuSXdO_8&ab_channel=Demuxed

SSIM Introduction

发表于 2024-07-28 更新于 2025-05-19 分类于 Multimedia 本文字数： 1.8k 阅读时长 ≈ 3 分钟

SSIM 是2004年提出来的，原始论文在这里。SSIM相较PSNR更符合人眼的主观感觉，因此在很多图片视频任务中广泛使用，包括压缩，降噪等等。本文简单介绍下SSIM公式，并根据论文On the Mathematical Properties of the SSIM对SSIM做了分析，最后总结了如何在编码器中针对SSIM来做优化。

SSIM简介

SSIM的计算公式如下：

\[ S(x, y) = f(l(x, y), c(x, y), s(x, y)) \]

其中l(x, y)代表x, y对于mean的相似度，c(x, y)代表的contrast的对比度，s(x, y)代表的是结构的相似度；他们的计算方式分别如下

\[ l(x, y)=\frac {2\mu_x \mu_y + C_1} {\mu_x^2 + \mu_y^2 + C_1} \]

其中$\mu_x$, $\mu_y$ 分别为块的均值；其中$\mu_x$, $\mu_y$相差越大，$l(x, y)$就越小；

而对比度相似度c(x, y)的计算方式与l(x, y)的相似，具体如下

\[ c(x, y) = \frac{2\sigma_x\sigma_y + C_2}{\sigma_x^2+\sigma_y^2 + C_2} \]

结构相似度s(x, y)无法表示成一个数，作者是把他转换成归一化的向量，然后计算下余弦相似度:

\[ s(x, y) = \frac{\sigma_{xy} + C_3}{\sigma_x\sigma_y + C_3} \]

其中$\sigma_{xy}$为x, y的covariance，expected value of the product of their deriations from there individual expected values. 计算方式为

\[ \sigma_{xy} = \frac{1}{N-1}\sum_{i=1}^N(x_i - \mu_x)(y_i - \mu_y) \]

也就是两个相对于均值一起变动的节奏或者波形是否一致。可以考虑一维的情况，协方差在计算的时候会减去均值。

如果$C_3 = C_2/2$，最终的SSIM的公式为

\[ SSIM(x, y) = \frac{(2\mu_x\mu_y + C1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} \]

因此SSIM(x, y) 小于1，大于-1；

On the Mathematical Properties of the SSIM

作者首先把SSIM分成两部分，

\[ S_1(x,y) = l(x,y) = \frac{2\mu_x\mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1} \\ S_2(x,y) = \frac{2\sigma_{xy} + C_2}{\sigma_x^2 + \sigma_y^2 + C_2} \]

然后使用$d_1 = \sqrt{1 - S_1}$和$d_2 = \sqrt{1 - S_2}$作为normalized metrics; 作者先证明了NRMSE是一个metrics，然后说明$d_1$和$d_2$是NRMSE的一种。NRMSE的定义如下

\[ NRMSE(x,y,c) = \frac{||x-y||_2}{\sqrt{||x||_2 + ||y||_2 + c}} \]

之后通过构建$D_2 := ||d(x,y)||_2 = \sqrt{2 - S_1(x,y) - S_2(x, y)}$来表示一个距离，$||d(x,y)||_2$是$\sqrt{1-SSIM(x,y)}$的低阶近似。

在图片或者视频的很多处理中，大部分时候mean都是保持一致的，因此SSIM中的$c(x,y)$和$s(x,y)$更加重要。作者也通过计算SSIM和$D_2$的相关性，可以看出图像的均值相差不大的情况下，SSIM和其他$D_2$的相关性是很高的，大于0.967；

如何针对SSIM做优化

从以上分析可以看到，SSIM可以近似表示为$d_2$，$d_2$可以理解为相对损失，因此相同的SSIM分数下，复杂的块，也就是$\sigma_x$ 更大的块，可以允许有更大的损失。也可以把SSIM理解为weighted MSE，加权的因子是$\frac{1}{\sigma_x}$，因此简单的块有更大的权重。在图片或视频压缩中，可以针对简单区域分配更多的比特从而优化SSIM的表现。

How to evaluate Lossy image compression

发表于 2024-07-28 更新于 2025-05-19 分类于 Multimedia 本文字数： 2.6k 阅读时长 ≈ 5 分钟

Introduction

图像压缩属于图片处理的一种，经过处理后的图片，文件体积一般会变小。

图像压缩分为无损压缩和有损压缩，无损压缩对质量没有影响，压缩率比较低；而有损压缩则通过牺牲一定的质量，达到更高的压缩率。一般压缩比越高，体积越小，质量越差。

因此在评测质量的时候，分为客观评测和主观评测。其中客观评测是指用一些算法来对图像打分，而主观评测则是人眼对图像质量进行打分。毫无疑问，主观质量评测费时费力，而且不同人评测的结果不一样，包含了很多噪声。同时，主观评测无法给出精度很高的分数，只能给出一个粗略的分数；而客观质量与主观质量相反，精准度高，容易计算，但客观质量指标的缺点是他和主观不一定吻合，目前也没有完美的客观指标；

图片的内容千奇百遍，不同场景下的内容特征会差异很大。因此在评测的时候需要根据自己的需要选择出代表图片来，使用这些图片来做评测；

图像压缩通常情况下会涉及比较复杂的操作，一般都比较费时，在比较的时候需要考虑到耗时这个因素；因此在评判一个压缩算法，除了需要考虑质量和体积，还需要考虑耗时。

Test-set

在评测的时候，需要先选定测试集，可以使用公开的测试集，也可以使用自己定制的测试集。以下是几个测试集可供选择

Cloudinary CID22

JPEG AIC-3

Kodak & CLIC

客观评测

针对图片的客观指标是一个很活跃的研究领域，每年都会有新的客观指标出来。常用的客观指标有PSNR, SSIM, VMAF.SSIM, DSSIM, ssimara 2, VMAF。

PSNR

PSNR是pixel-level的指标，它会对比两张图片中每个像素的差异并加权平均。PSNR没有考虑人眼视觉模型，和主观不能很好的吻合，但计算简单。

后面基于PSNR开发了HVS-PSNR，WPSNR和XPSNR，其中WPSNR在计算的时候对每个块赋值一个权重，最终的PSNR是一个加权的PSNR.

SSIM

SSIM是2004年提出的，它同时考虑了均值，对比度和结构信息，比PSNR更符合人眼的主观。SSIM的具体公式如下:

$SSIM(x, y) = \frac{(2\mu_x\mu_y + C1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}$

基于SSIM也有很多变体，包括MS-SSIM，DSSIM等

SSIM考虑到了人眼视觉效应，对平坦区域的失真更加敏感，这和人的主观是吻合的。

VMAF

VMAF是由Netflix提出来的，内部计算了多个指标，如Visual Quality Fidelity(VIF), Detail Loss Measure(DLM) 以及Temporal Information(TI)，最终采用机器学习(SVM)的方式对多个指标进行融合得到最终的分数。VMAF与主观相关性较高，在业界内被广泛使用。

VMAF相关的模型有VMAF-Neg(VMAF-Non-enahncement-gain)，

主观评测

主观评测主要由人员做主观打分，并对打分后做统计分析，目前主流的评测方法都是基于ITU-R BT.500来做的.

一般使用Double Stimulus Continuous Quality Scale的方式进行评测，评测的时候把原图和测试图并排放置，并随机打乱顺序，由观看者对两幅图片进行打分，分值是5分制，分别表示Excellent, Good, Fair, Poor and Bad. 因此这是一种全参考的设置方式。

还有种常见的主观评测是GSB(Good:Same:Bad)评价，适合于两个编码器对比分析；

除此以外，还可以采用以下几种方式:

ACR(Absolute Category Rating): 每次评估一个视频，对视频打分; 量级分为(Bad, Poor, Faire, Good, Excellent)

DCR(Differential Comparison Rating): 每次评估一对视频(源视频，处理后视频)，对处理后视频相对于源视频的差异进行打分；(Imperceptible, Perceptible but not annoying, Slightly annoying, Annoying, very annoying)

PC/CCR: 每次评估一对视频(都是处理后的视频，对比两者的处理效果)，给出哪个视频更好的打分；打分的量级分为(Much worse, worse, slightly worse, the same, slightly better, Better, Much Better)

打分之后一般使用MOS(Mean Opinion Score)或者DMOS(Differential Mean Opinion Score)来统计出分数来；MOS常用与ACR，分数越高，质量越好；而DMOS常用于DCR，分数越低，代表与原视频越接近).

打完分之后需要做统计分析(ANOVA, T-Test)，判断下观察到的差异是因为质量变化还是因为随机的改变；另外可以检查和客观指标的吻合度，同时分析对于不同的subgroup的表现如何。

速度对比

可以计算编码/解码所用时间，内存占用率等来判断下编码器的速度；

不同Codec的对比

因为编码器在不同质量等级下生成图片的质量和大小都不一样，如果只跑出一个数据来，因为质量和体积都不一样，无法直接比较。工业界常用的方式是使用多个质量等级编码出多个数据(≥ 4)，然后计算出对应的质量和体积，最后计算BD-Rate。例如有两个JPEG编码器A和B，其中A是anchor编码器，B是测试编码器。计算BD-Rate的时候，可以用A和B跑出四个质量等级的码流，分别计算码流对应的码率和质量，最后计算BD-Rate，如果BD-Rate为正值，如3%，可以解释为要达到同样的客观指标，B要比A多用3%的码率，也就是B的压缩率更低。如果BD-Rate为负值，如-10%，则说明B要比A少用10的码率，即B的压缩率更低。

注意计算BD-Rate的时候，需要考虑到选用的客观指标和质量等级，这些需要和实际场景中的匹配上。BD-Rate的介绍可以参考How to use BD-Rate to evaluate codecs。

评测框架可以参考WebM codec-compare

总结

图片压缩的评测并不简单，中间有很多的坑，在实践中需要选好测试集，并交叉对比主观评测的结果和客观评测的结果，作为一个参考。在涉及多个codec的时候，可以采用BD-Rate来比较两个Codec。

How to use BD-Rate to evaluate codecs

发表于 2024-07-07 更新于 2025-05-19 分类于 Multimedia 本文字数： 1.2k 阅读时长 ≈ 2 分钟

BD-BR计算和注意事项

在视频压缩领域，常常需要比较不同算法或者不同编码器的优劣，而评价一个编码器很重要的一个指标是压缩率。计算压缩率需要比较质量和码率，相同画质下码率越低，压缩率越高；或者相同码率下，质量越高，压缩率也越高。

在实际比较的时候，因为很多的编码算法会同时影响质量和码率，导致码率或者质量无法卡齐，不好直接对比压缩率。在这种情况下就需要BD-Rate这个工具。

BD-BR全称是Bjontegaard delta bitrate (BD-BR) ，与2001年提出来的。它可以定量比较不同编码器的压缩性能，这些年一直被广泛用于编码器的评测；它主要是通过计算两条RD曲线上之间的差距来度量不同编码器的优劣，具体可以参见下图:

R-D曲线的X轴是码率，Y轴是客观指标。计算的时候可以计算R-D曲线针对X轴的积分(BD-PSNR)，也可以计算R-D曲线对于Y轴的积分(BD-Rate)；BD-PSNR可以认为是delta psnr，因此正值代表了画质提升；BD-Rate可以认为是delta rate，因此负值代表了码率节省。这里重点说明下BD-Rate.

计算方式

假如要评测两个编码器A和B的压缩率，衡量质量的标准使用PSNR；在计算BD-Rate之前，需要先使用4个或更多个码率点来进行编码，例如2Mbps, 2.5Mbps, 3Mbps和4Mbps，编码完成后统计对应4个码率点的码率和质量(PSNR);

对码率先做对数变换，这个主要是为了防止不同质量等级下码率差异太大，导致高码率对整个的计算影响太大;

\[ r = log_{10}R \]

使用interpolation函数拟合出两条Rate-Distortion曲线，X轴是Rate，Y轴是PSNR; 例如$r = a + b * D + c * D^2 + d*D^3$
计算两条曲线对于Y轴积分的差，积分的时候只对Distortion有重叠的区域做积分；因此需要先计算出$D_{high}$和$D_{low}$
对Distortion部分做normalize，也就是除以$D_{high} - D_{low}$
最后再通过幂函数恢复到原来的域

最后的整个公式就是

\[ \Delta R =10^{\frac{1}{D_{high} - D_{low}} \int_{D_{low}}^{D_{high}} (r_b - r_a)dD} \]

以下是Netflix开源的bdrate计算脚本: https://github.com/Netflix/vmaf/tree/master/python/vmaf/tools

注意事项

虽然BD-Rate对应着码率节省，但实际计算的时候BD-Rate并不等于码率节省的比例。

另外BD-Rate有单调性的假设，也就是质量会随着码率升高而升高。

一般来说，0.5db提升等同于-10%的BD-Rate；

Using Mermaid in hexo

发表于 2023-10-25 更新于 2025-05-19 分类于 Multimedia 本文字数： 473 阅读时长 ≈ 1 分钟

Mermaid是用文本来描述Diagram，因为博客使用了hexo和Next，所以尝试在把Mermaid用起来，可以编写一些代码流程；

Install Mermaid: $ npm install hexo-filter-mermaid-diagrams

Modify Next config, like _config.next.yml，把mermaid打开。

# Mermaid tag
mermaid:
  enable: true
  # Available themes: default | dark | forest | neutral
  theme:
    light: default
    dark: dark

Modify hexo config, _confg.yml:

1
2
3

highlight:
  exclude_languages:
    - mermaid

Start to create diagram using Mermaid syntax, like

graph TD
A[Hard] -->|Text| B(Round)
B --> C{Decision}
C -->|One| D[Result 1]
C -->|Two| E[Result 2]

Webp Introduction

发表于 2023-10-25 更新于 2025-05-19 分类于 Multimedia 本文字数： 4.5k 阅读时长 ≈ 8 分钟

What is Webp?

Webp是一种图像格式，支持单张图片，动图；什么平台支持？如何使用？

Open-Source projects

libwebp: https://chromium.googlesource.com/webm/libwebp
mirror: https://github.com/webmproject/libwebp/

User Guide

Options

q: 指定quality factor
alpha_q: 指定透明通道的压缩率
preset: 指定图片的type
-z: 指定lossless压缩的效率
m: 压缩的速度档次，0最快，6是最慢
segment: 指定要使用的segments，默认是4
-size: 指定目标size (in bytes)
-psnr: 指定目标的PSNR
-sns: spatial noise shaping, important
-f: filter strength, default 60，是指loop filter的强度
-sharpness: filter sharpness, 0: most sharp, 7: 最小sharp
-strong: use strong filter
-sharp_yuv: use sharper RGB->YUV的转换
-partition_limit
-pass: analysis pass number
-qrange，指定最小和最大QP的范围
-mt: 使用multi-threading
-alpha_method: 指定透明通道压缩的方法
-alpha_filter: predictive filtering for alpha plane
-exact: preserve RGB value in transparent area; 也就是不修改完全透明区域的
-noalpha: 丢掉alpha的特征
-lossless: 无损压缩图像
-near_lossless: use near-lossless image preprocessing
-hint: 指定图片的特征或者提示；
-af: using autoaf，指标是ssim

如何把其他图片转换成Webp?

JPG format YUVJ420: YUV420P with JPEG color-range" (i.e. pixels use 0-255 range instead of 16-235 range, where 0 instead of 16 is black, and 255 instead of 235 is white);

使用工具cwebp，最简单的使用就是: cwebp input.png -q 80 -o output.webp

重要的参数

重要的参数包括-preset -sns -f -m: -preset: 用于指定content的类型，应该最先被设置，这样后续的设置就在这个基础上来修改; 编码器内部会根据preset去调整sns_strength, filter_sharpness, filter_strength和preprocessing.
-f: 对应的是编码器内部的in-loop filter的filter strength;
-m: 用来指定编码速度，[0, 6]，0最快，6最慢，default是4
-sns: 用来打开一些visual optimization，主要用来调节bit allocation. 用来把easy part的bit分配给hard part。通常升高sns，会导致文件增大，质量变好；

sns

method >= 4: distortion会包括spectural distortion Distortion: 在method > 4的时候，会有distortion和spectral distortion 设置segment的时候，可以控制幅度。控制uv_ac，越小的话，dq_uv_ac变大，和dq_uv_dc，越大,uv_dc boost的越多 It will set do_search_

how to calculate alpha: Do transform, collect coefficients stats, find last_none_zero and max_values; alpha = last_non_zero/max_value 1. 计算每个mb的alpha 原理： TX: 4x4 convert coefficients to bins，calculate the histogram of abs(level) max_value: 出现系数最多的系数的出现次数，越大，说明0系数的越多，说明能量越少 last_non_zero: 最后一个非零系数，越大，说明高频的能量越多， alpha = ALPHA_SCALE * last_non_+zero / max_value; best_alpha = MAX_ALPHA - (3 * best_alpha + best_uv_alpha + 2) >> 2 高能量:

MBAnalyze:323 analyze mbx 0, mby 1
mode 0, last_nonzero_count 31, max_value 95
mode 1, last_nonzero_count 31, max_value 88
MBAnalyze:330 best_alpha 179
MBAnalyze:341 mbx 0, mby 1, alpha 57, uv_alpha 255

低能量：

MBAnalyze:323 analyze mbx 0, mby 3
mode 0, last_nonzero_count 11, max_value 234
mode 1, last_nonzero_count 4, max_value 244
MBAnalyze:330 best_alpha 23
MBAnalyze:341 mbx 0, mby 3, alpha 234, uv_alpha 16

alpha越大，Q越小； beta

filter params

Decoder perspective: 9.4 Loop Filter Type and Levels frame-level loop filter params: 1. filter_type: normal or simple, simple filter only applied to luma edges;
2. loop_filter_level: defines the threshold, if the different is below this threshold, it should be filtered, otherwise, it should not be changed. Usually the level correlated to quantizer levels.
3. sharpness_level, constant over the frame, 如果loop_filter_level=0，那么应该跳过loop_filter;
Differences in excess of a threshold(associated to the loop_filter_level) are assumed to not being modified.

mode_ref_lf_delta_update=1, allow per-macroblock adjustment, through delta; There are two types of delta values, one for reference frame-based adjustment, and the other group is for mode-based adjustment.

Filter header to write into bitstreams: simple: level: 对应fstrength_ sharpness

sharpness设置越大，fstrength越大； Q越大，fstrength越大；

fstrength是针对segment来的；

lf_delta

process; base_strength = F(q, sharpness) strength = 5 * filter_strength * base_strength / *(256 + beta_) level0 = fstrength_

只filter 4x4的块，不filter 16x16和skip的块；

How to decide the strength for each segments?

Why VP8StoreFilterStats? 存储不同的delta_q下，do filter之后的ssim并存储起来； How to decide final strength_; VP8AdjustFilterStrength

Internal Process

Speed 0/1

Speed 2~6

Alpha Encoding

主要通过一下几个接口来处理 1. VP8EncInitAlpha 2. VP8EncStartAlpha 3. VP8EncFinishAlpha 4. VP8EncDeleteAlpha

graph TD;
    A-->B;
    A-->C;
    B-->D;
    C-->D;

主要调用EncodeAlpha进行处理 ## Feature ### Specify target_size/target_PSNR sns:

Region-based quantization

Adaptive Loop-filter

根据prediction mode或参考帧来决定loop filter strength Adjustment of loop-filter stregnth for each segment.

RDO

Preprocessing

sharp YUV

This is applied when converting RGB to YUV #### Smooth segment #### pseudo-random dithering log2 #### Alpha cleanup ### Lossless 主要通过调用WebPEncodeLosslessRGB，内部是调用EncodeLossless. VP8LEncodeImage is used to encoding lossless.
内部调用VP8LEncodeStream QuantizeLevels VP8FiltersInit ApplyFiltersAndEncode EncodeAlphaInternal ---> EncodeLossless ---> VP8LEncodeStream ### Segment 不打开Segment情况下，

rd_opt_level: 1. no_rd 2. rd_opt_basic 3. rd_opt_trellis, perform trellis-quant on the final decision only 4. rd_opt_trellis_all

if rd_opt_level < rd_opt_basic>, then use VP8EncLoop ## Reference Official-Webp-Doc
React Native using Webp
Webp Tools
VP8 Encode Parameter Guide
VP8 Tech Overview
Webp Compression Techniques