通过RoaringBitmap实现人群预估功能

2023-06-05 约 4147 字预计阅读 9 分钟

/posts/images/roaring-bitmap-audience-0.png

RoaringBitmap是什么?

RoaringBitmap 是一种数据结构，用于存储压缩大规模非负整数集合，它在存储空间和时间效率上都比传统的位图(Bitmap)¹和哈希表(Hash Table)²更加优越。当时主要是为了解决大规模数据场景下整数集合存储、聚合以及查询运算等问题而发明了RoaringBitmap。

主要思想

我们将 32-bit 的范围 ([0, n)) 划分为 2^16 （即65536）个桶，每一个桶有一个 Container 来存放一个数值的低16位；
在存储和查询数值的时候，我们将一个数值 k 划分为高 16 位(k / 2^16)和低 16 位(k mod 2^16)，取高 16 位找到对应的桶，然后在低 16 位存放在相应的 Container 中；
容器的话， RBM 使用两种容器结构： Array Container 和 Bitmap Container。Array Container 存放稀疏的数据，Bitmap Container 存放稠密的数据。即，若一个 Container 里面的 Integer 数量小于 4096，就用 Short 类型的有序数组来存储值。若大于 4096，就用 Bitmap 来存储值。

RoaringBitmap业务场景中有什么用?

roaringbitmap在实际业务中可以用来存储用户属性标签，根据存储用户的标签通过交集、并集等方法筛选出特定的用户，以达到大规模数据精准快速查找目的，即提升了性能同事又能降低存储空间。

标签人群预估

在传统业务模式下，假如有一张广告用户标签标：

用户ID	标签
1	{上海市,普陀区,iPhone XR,苹果,上班族,男性}
2	{上海市,青浦区,iPhone 14pro,苹果,上班族,男性}
3	{上海市,嘉定区,华为Mate 40,华为,上班族,男性}
….
100000000	{iPhone}

若想要找到使用iPhone XR的用户，就需要根据标签进行检索，找到标签中包含iPhone XR的行，然后将此数据返回给对应的应用处理。

简单的做法是怎么样的

其实就是按照上面的表结构在数据库中创建用户对应的标签表，然后进行数组查询语句。

这样做会有一个比较明显的问题就是在数据量较大且标签值较多的场景下，数据冗余不仅数据量占用更多，而且性能会非常差。

还有一种方案，roaringbitmap方案将上述表拆分成三张表，标签作为主键，包含此标签的用户作为Bitmap存储。如下：

用户表

用户主键ID	用户UID
1	Uuid1
2	Uuid2
N	……

标签表

标签主键ID	标签名	标签值
1	手机型号	{iPhone XR,iPhone 11,iPhone12,iPhone 13…..}
2	手机品牌	{苹果,华为,小米,荣耀….}
3	常住地	{北京,上海,广州,深圳….}
N	……	……

用户标签表

标签ID	标签值	用户ID
1	iPhone XR	{1,3,5,7,123,456}
2	华为	{2,4,6,1234,5677}
3	上海	{6,8,9,12345}
N	……	……

当需要根据查找同时常住地上海和手机使用iPhone XR的用户时，直接在用户标签标对用户ID的bitmap进行查询，能够极大提升性能并且降低存储容量。

测试操作准备

1.创建随机手机品牌的函数。

create or replace function random_phone_brand() returns text as
$$
declare
chars text[] := '{华为,小米,OPPO,vivo,iPhone,三星,魅族,红米,Realme,荣耀,一加,索尼,联想,诺基亚,黑鲨,努比亚,华硕,中兴,LG,谷歌,摩托罗拉,HTC,夏普,蓝牙,金立,360,锤子科技}';
result text := '';
begin
result := result || chars[1+random()*(array_length(chars, 1)-1)];
return result;
end;
$$ language plpgsql;

2.创建随机手机型号的函数

create or replace function random_phone_model() returns text as
$$
declare
chars text[] := '{华为Mate40Pro,小米11Ultra,OPPOFindX3Pro,vivoX60Pro+,一加9Pro,魅族18Pro,iQOO7Legend,红米K40Pro+,realmeGT,荣耀30Pro+,iPhone12ProMax,三星GalaxyS21Ultra,诺基亚X20,黑鲨4Pro,努比亚RedMagic6Pro,联想拯救者,格力手机,华硕ROGPhone5,飞利浦S702H,宏碁PredatorX,夏普AquosR6,中兴Axon30Ultra,美图M8Pro,TCL20Pro5G,LGV60ThinQ5G,海信A7,优派SWORDFISH,锤子坚果R2,vivoiQOO5Pro,小米11Pro,索尼Xperia1III,OPPOReno6Pro+,红魔6Pro,国际版华硕ZenFone8,SONYXperia5III,旗舰版坚果R9,PalmPhone,京东方子弹来了,GrephoneG636,斐讯Note7,酷派S100,台电T1,凯利通XT110,卡布达Plus,小牛手机K5,GretelA7,克莱因U+,蓝魔非凡版,赫维L7,百度DuerOS,文字王,吉比特R10,恒宇世纪M2,尼凯恩手机,艺品手机U5,中恒A7,方正光彩,悦锤W810,北京黑马,炫耀V6,樱花V801,旺百士X1,巨蝎S7,东信TC866,响尾蛇C1,志高M1,飞利浦S807Y,平红手机,学长手机,小鱼手机,老干妈手机,液压手机,奋斗手机,彩虹C1,金刚M111,爱科技M101,K-TouchH999,金立M10,HTCOneX9,摩托罗拉MotoZ,VIVOY66L,OPPOR9s,努比亚Z11Max,小米Mix2,华为P20,三星Note8,联想Z5ProGT,vivoNexS,荣耀Magic2,iPhoneSE2020,魅族Note8,红米Note9,realmeQ3Pro}';
result text := '';
begin
result := result || chars[1+random()*(array_length(chars, 1)-1)];
return result;
end;
$$ language plpgsql;

3.创建随机主要城市的函数

create or replace function random_city() returns text as
$$
declare
chars text[] := {北京, 上海, 广州, 深圳,杭州, 南京, 苏州, 成都, 重庆, 武汉, 西安,郑州, 长沙, 福州, 厦门, 合肥, 南昌, 济南, 青岛, 太原, 沈阳, 大连};
result text := '';
begin
result := result || chars[1+random()*(array_length(chars, 1)-1)];
return result;
end;
$$ language plpgsql;

4.创建生产随机整型数组的函数。

create or replace function random_int_array(int, int)
returns int[] language sql as
$$
select array_agg(round(random()* $1)::int)
from generate_series(1, $2)
$$;

5.创建随机字符串

create or replace function random_string(length integer) returns text as
$$
declare
chars text[] := '{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';
result text := '';
i integer := 0;
length2 integer := (select trunc(random() * length + 1));
begin
if length2 < 0 then
raise exception 'Given length cannot be less than 0';
end if;
for i in 1..length2 loop
result := result || chars[1+random()*(array_length(chars, 1)-1)];
end loop;
return result;
end;
$$ language plpgsql;

6.创建一个生成随机字符数组的函数

create or replace function random_string_array(int, int)
returns TEXT[] language sql as
$$
select array_agg(random_string($1)) from generate_series(1, $2);
$$;

方案1：传统表方案

一张表解决一切。

1.创建一个表包含所有数据。

create table account(
	id bigint primary KEY,
	wuid varchar,
	tag TEXT []
);

2.模拟插入1000w用户数据（需要使用到场景准备工作中的函数），并且创建 Gin 索引。

insert into account select generate_series(1,100000000), random_string(32), ARRAY[random_phone_brand() , random_phone_model(), random_phone_price(), random_city(),random_string(5),random_string(10)];

create index tag_inx on account USING GIN(tag);

3.执行查询，查找标签带华为和上海的用户列表。

explain analyze select id, wuid from account where tag @>ARRAY['华为','上海'];

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on account  (cost=82.21..9287.03 rows=8620 width=25) (actual time=20.083..28.754 rows=9173 loops=1)
   Recheck Cond: (tag @> '{华为,上海}'::text[])
   Heap Blocks: exact=8973
   ->  Bitmap Index Scan on tag_inx  (cost=0.00..80.05 rows=8620 width=0) (actual time=18.934..18.934 rows=9173 loops=1)
         Index Cond: (tag @> '{华为,上海}'::text[])
 Planning Time: 0.099 ms
 Execution Time: 29.604 ms

4.执行查询，查找标签华为和上海的人有多少个（第一次查询会较慢）。

explain analyze select count(id) from account where tag @>ARRAY['华为','上海'];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=9308.58..9308.59 rows=1 width=8) (actual time=28.611..28.613 rows=1 loops=1)
   ->  Bitmap Heap Scan on account  (cost=82.21..9287.03 rows=8620 width=8) (actual time=19.259..27.603 rows=9173 loops=1)
         Recheck Cond: (tag @> '{华为,上海}'::text[])
         Heap Blocks: exact=8973
         ->  Bitmap Index Scan on tag_inx  (cost=0.00..80.05 rows=8620 width=0) (actual time=18.112..18.112 rows=9173 loops=1)
               Index Cond: (tag @> '{华为,上海}'::text[])
 Planning Time: 0.091 ms
 Execution Time: 28.852 ms

方案2：减少标签数据冗余：优化方案

为了降低查询中标签字段的类型导致的性能减低，所以将上面表中的真实 tag 修改为 tag_id。

1.引入一个新的标签字典表

CREATE TABLE tag_dict (
	id int primary key,
  tag text
  tag_value_id int
  tag_value text
);

2.一共有1W种字典标签类型

insert into tag_dict select generate_series(1,10000), md5(random()::text), random_integer(10), random_string(10);

3.创建一个新表用来存储用户和标签信息

create table account_opt(
	id bigint primary KEY,
	wuid varchar,
	tag INT []
);

4.模拟1000w用户标签数据

insert into account_opt select generate_series(1,10000000), random_string(32),random_int_array(10000, 10);

5.查找同时有标签 ID 为1024和2048的用户列表。

索引前：

explain analyze select id, wuid from account_opt where tag @> ARRAY[1024, 2048];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..198483.98 rows=4 width=25) (actual time=22.696..4496.574 rows=9 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on account_opt  (cost=0.00..197483.58 rows=2 width=25) (actual time=348.374..4476.256 rows=3 loops=3)
         Filter: (tag @> '{1024,2048}'::integer[])
         Rows Removed by Filter: 3333330
 Planning Time: 0.075 ms
 Execution Time: 4496.600 ms

加索引：

create index tag_inx_opt on account_opt USING GIN(tag);

索引后：

explain analyze select id, wuid from account_opt where tag @> ARRAY[1024, 2048];
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on account_opt  (cost=6.63..11.08 rows=4 width=25) (actual time=0.798..0.817 rows=9 loops=1)
   Recheck Cond: (tag @> '{1024,2048}'::integer[])
   Heap Blocks: exact=9
   ->  Bitmap Index Scan on tag_inx_opt  (cost=0.00..6.63 rows=4 width=0) (actual time=0.790..0.791 rows=9 loops=1)
         Index Cond: (tag @> '{1024,2048}'::integer[])
 Planning Time: 0.086 ms
 Execution Time: 0.836 ms

6.查找标签人群数量（标签1024和2048）的人有 xx 个。

explain analyze select count(id) from account_opt where tag @> ARRAY[1024, 2048];
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=11.09..11.10 rows=1 width=8) (actual time=0.793..0.794 rows=1 loops=1)
   ->  Bitmap Heap Scan on account_opt  (cost=6.63..11.08 rows=4 width=8) (actual time=0.778..0.788 rows=9 loops=1)
         Recheck Cond: (tag @> '{1024,2048}'::integer[])
         Heap Blocks: exact=9
         ->  Bitmap Index Scan on tag_inx_opt  (cost=0.00..6.63 rows=4 width=0) (actual time=0.771..0.772 rows=9 loops=1)
               Index Cond: (tag @> '{1024,2048}'::integer[])
 Planning Time: 0.091 ms
 Execution Time: 0.820 ms

方案3：Roaringbitmap方案

1.创建标签用户对应表

create table tag_wuid_list(
	tag_id int primary key,
	tag_name text,
	wuid_offset int,
	wuid_bits roaringbitmap
);

2.根据之前的标签表插入1W条标签以及标签对应的用户数据。

insert into tag_wuid_list
select 
	tag_id, 
	random_string(5),
	wuid_offset, 
	rb_build_agg(id::int) as uinbits 			-- 将Offset聚合成bitmap
from (
	select
		unnest(tag) as tag_id, 							-- 以将数组展开成一列，每个数组元素成为一行数据
		(id / (2^31)::int8) as wuid_offset, -- 取余
		mod(id, (2^31)::int8) as id				  -- 取模
		from account_opt
) t
group by tag_id, wuid_offset;

3.查询标签有7，8，9，10的用户个数

explain analyze select sum(wb) from
(
  select wuid_offset, rb_and_cardinality_agg(wuid_bits) as wb
  from tag_wuid_list
  where tag_id in (1024,2048)
  group by wuid_offset
) t;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3.96..3.97 rows=1 width=32) (actual time=0.189..0.190 rows=1 loops=1)
   ->  GroupAggregate  (cost=3.92..3.94 rows=1 width=12) (actual time=0.186..0.187 rows=1 loops=1)
         Group Key: tag_wuid_list.wuid_offset
         ->  Sort  (cost=3.92..3.92 rows=2 width=22) (actual time=0.029..0.029 rows=2 loops=1)
               Sort Key: tag_wuid_list.wuid_offset
               Sort Method: quicksort  Memory: 25kB
               ->  Index Scan using tag_wuid_list_pkey on tag_wuid_list  (cost=0.29..3.91 rows=2 width=22) (actual time=0.018..0.023 rows=2 loops=1)
                     Index Cond: (tag_id = ANY ('{1024,2048}'::integer[]))
 Planning Time: 0.103 ms
 Execution Time: 0.220 ms

4.查询标签有7，8，9，10的用户列表

explain analyze select wuid_offset,rb_or_agg(wuid_bits) as wb
from tag_wuid_list
where tag_id in (7,8,9,10)
group by wuid_offset;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on account_opt  (cost=6.63..11.08 rows=4 width=25) (actual time=0.771..0.781 rows=9 loops=1)
   Recheck Cond: (tag @> '{1024,2048}'::integer[])
   Heap Blocks: exact=9
   ->  Bitmap Index Scan on tag_inx_opt  (cost=0.00..6.63 rows=4 width=0) (actual time=0.764..0.764 rows=9 loops=1)
         Index Cond: (tag @> '{1024,2048}'::integer[])
 Planning Time: 0.082 ms
 Execution Time: 0.802 ms

查看索引以及表占用大小

select relname, pg_size_pretty(pg_relation_size(relid)) from pg_stat_user_tables where schemaname='public' order by pg_relation_size(relid) desc;

          relname           | pg_size_pretty
----------------------------+----------------
 account                    | 1494 MB
 account_opt                | 1136 MB
 tag_wuid_list              | 624 kB

select indexrelname, pg_size_pretty(pg_relation_size(relid)) from pg_stat_user_indexes where schemaname='public' order by pg_relation_size(relid) desc;

    indexrelname    | pg_size_pretty
--------------------+----------------
 account_pkey       | 1494 MB
 tag_inx            | 1494 MB
 account_opt_pkey   | 1136 MB
 tag_inx_opt        | 1136 MB
 tag_wuid_list_pkey | 624 kB

Bitmap聚合函数列表³

Bitmap聚合函数主要包括的函数及说明如下表所示：

函数名	输入数据类型	输出数据类型	描述	示例
rb_build_agg	integer	roaringbitmap	将Offset聚合成bitmap。	`rb_build_agg(1)`
rb_or_agg	roaringbitmap	roaringbitmap	Or聚合计算。	`rb_or_agg(rb_build('{1,2,3}'))`
rb_and_agg	roaringbitmap	roaringbitmap	And聚合计算。	`rb_and_agg(rb_build('{1,2,3}'))`
rb_xor_agg	roaringbitmap	roaringbitmap	Xor聚合计算。	`rb_xor_agg(rb_build('{1,2,3}'))`
rb_or_cardinality_agg	roaringbitmap	integer	Or聚合计算并返回其基数。	`rb_or_cardinality_agg(rb_build('{1,2,3}'))`
rb_and_cardinality_agg	roaringbitmap	integer	And聚合计算并返回其基数。	`rb_and_cardinality_agg(rb_build('{1,2,3}'))`
rb_xor_cardinality_agg	roaringbitmap	integer	Xor聚合计算并返回其基数。	`rb_xor_cardinality_agg(rb_build('{1,2,3}'))`

结论

不同方案的查询性能对比如下表：

查询项	传统表方案	传统表方案-优化版	roaringbitmap 方案
查询共同标签的用户个数	28.852 ms	0.820 ms	0.220 ms
查询包含指定标签的用户列表	29.604 ms	0.836 ms	0.802 ms
数据容量统计	4482 MB	3408 MB	1250 KB

基于上述三种方案可以明显看到，优化后效果非常明显，无论是容量还是性能都强于传统方案，roaringbitmap 方案整体上来看，无论是查询耗时还是数据容量占用都有很好的性能和效果。

Daniel Lemire是一位加拿大计算机科学家，他是蒙特利尔大学计算机科学系的教授。 ↩︎
它基于位运算和位装置的原理，可以用于优化和压缩数据，减少内存使用，加速计算速度，提高算法效率等方面。是一种数据结构，用于对大量数据的存在状态进行显式处理。 ↩︎
Roaring Bitmap函数 https://help.aliyun.com/apsara/agile/v_3_6_0_20210705/hologram/ase-user-guide/roaring-bitmap-function.html ↩︎