MongoDB 聚合优化实战

实际需求

实际项目中，可能遇到一些非常复杂的情况

比如：按 siteCode 字段分组，其中根据评分 score 字段由高到低排序，按 20%、30%、50%的形式分成 3 组，计算里面不同 vertical 的数量

解决方案

第一步、获取所有 siteCode 个数

db.getCollection('articleCollection').aggregate([
    {
        "$match": {
            "type": {
                "$ne": "随便写的"
            },
            "vertical": {
                "$ne": "随便写的"
            },
            "score": {
                "$gte": 0
            },
            "publishTime": {
                "$gte": 1553616000000,
                "$lte": 1553702400000
            },
            "areaStr": "湖南省,长沙市"
        }
    },
    {
        "$group": {
            "_id": {
                "siteCode": "$siteCode"
            },
            "count": {
                "$sum": 1
            }
        }
    }
])

拿到了每个 siteCode 的个数

第二步、计算每个 siteCode 的高中低各多少个

const v1 = 0.2;
const v2 = 0.5;

const n1 = Math.floor(number * v1);
const n2 = Math.floor(number * v2) - n1;
const n3 = number - n1 - n2;

number 是每种 siteCode 的总数

第三步、每一种高中低的分组个数

const r1 = n1 > 0 ? await this.getData({ ...querys, siteCode: key, limit: n1 }) : 0;
const r2 = n2 > 0 ? await this.getData({ ...querys, siteCode: key, limit: n2, skip: n1 }) : 0;
const r3 = n3 > 0 ? await this.getData({ ...querys, siteCode: key, skip: n1 + n2 }) : 0;

第四步、getData 方法的具体实现

    /**
     * 根据条件查询数据库
     * @param {obj} param0 传入对象
     * @param {number} param0.score_startAt 评分起始
     * @param {number} param0.score_endAt 评分结束
     * @param {string} param0.publishTime_startBy 发布时间起始
     * @param {string} param0.publishTime_endBy 发布时间结束
     * @param {string} param0.areaStr 地区省市
     * @param {string} param0.siteCode 来源代码
     * @param {obj} param0._id 分组key
     * @return {array} 查询结果
     */
    async getData({
        score_startAt,
        score_endAt,
        publishTime_startBy,
        publishTime_endBy,
        areaStr,
        siteCode,
        limit,
        skip,
        _id = {
            type: '$type',
            vertical: '$vertical'
        }
    }) {
        const searchList = [];

        const match = {
            $match: {
                type: { $ne: '随便写的' },
                vertical: { $ne: '随便写的' }
            }
        };

        if (score_startAt) {
            const key = 'score';
            const value = score_startAt;

            if (match.$match[key]) {
                match.$match[key].$gte = value;
            } else {
                match.$match[key] = { $gte: value };
            }
        }

        if (score_endAt) {
            const key = 'score';
            const value = score_endAt;

            if (match.$match[key]) {
                match.$match[key].$lt = value;
            } else {
                match.$match[key] = { $lt: value };
            }
        }

        // 如果没有传入评分起始结束值，评分条件为大于0，目的触发索引
        if (!score_startAt && !score_endAt) {
            const key = 'score';
            const value = 0;

            if (match.$match[key]) {
                match.$match[key].$gte = value;
            } else {
                match.$match[key] = { $gte: value };
            }
        }

        if (publishTime_startBy) {
            const key = 'publishTime';
            const value = moment(publishTime_startBy, 'YYYY-MM-DD HH:mm:ss').valueOf();

            if (match.$match[key]) {
                match.$match[key].$gte = value;
            } else {
                match.$match[key] = { $gte: value };
            }
        }

        if (publishTime_endBy) {
            const key = 'publishTime';
            const value = moment(publishTime_endBy, 'YYYY-MM-DD HH:mm:ss').valueOf();

            if (match.$match[key]) {
                match.$match[key].$lte = value;
            } else {
                match.$match[key] = { $lte: value };
            }
        }

        if (areaStr) {
            match.$match.areaStr = areaStr;
        }

        if (siteCode) {
            match.$match.siteCode = siteCode;
        }

        searchList.push(match);

        if (skip) {
            const $skip = { $skip: skip };
            searchList.push($skip);
        }

        if (limit) {
            const $limit = { $limit: limit };
            searchList.push($limit);
        }

        const group = {
            $group: {
                _id,
                count: { $sum: 1 }
            }
        };

        searchList.push(group);

        console.log(JSON.stringify(searchList));

        const res = await this.ctx.model.Stat.aggregate(searchList);
        return res;
    }

遇到的问题

1、数值不对

上面的 getData 是最后改好的，一开始 $skip 在 $limit 的下面

        ...
        if (limit) {
            const $limit = { $limit: limit };
            searchList.push($limit);
        }

        if (skip) {
            const $skip = { $skip: skip };
            searchList.push($skip);
        }
        ...

结果返回长度会是 $limit - $skip, 所以应该先 $skip 在 $limit

2、查询速度慢

根据需求，查询数据是要按 score 排序的

经试验，$sort 在 $match 之后就不能触发索引

而 $sort 放在 $group 之后，不再影响索引，但是 $sort 也没有用了，返回数据是没哟排序的

最后的解决方案是把索引中 score 放到索引中的第一项，这样返回数据默认就是按评分 score 排序的

{
    "score": -1,
    "publishTime": -1,
    "areaStr": -1,
    "siteCode": -1,
    "type": -1,
    "vertical": -1,
    "mediaAccountId": -1
}

这样的话，查询条件里就一定要有 score 字段，这样才能触发索引

// 如果没有传入评分起始结束值，评分条件为大于0，目的触发索引
if (!score_startAt && !score_endAt) {
    const key = 'score';
    const value = 0;

    if (match.$match[key]) {
        match.$match[key].$gte = value;
    } else {
        match.$match[key] = { $gte: value };
    }
}

最后

其实该实现方法存在问题，就是每个 siteCode 都要聚合 3 次，目前有十几个 siteCode，加起来就要聚合几十次!

就算每次聚合时间只有几百毫秒，加起来也可能比较长

幸好现在只有十几万条数据，最长需要 20s，最短 6、7s

最理想的方法是能通过 1 个 SQL 就把结果返回来，不过有点难实现

后续再想办法优化

实际需求 #

解决方案 #

第一步、获取 所有 siteCode 个数 #

第二步、计算 每个 siteCode 的 高中低 各 多少个 #

第三步、每一种 高中低 的 分组个数 #

第四步、getData 方法的 具体实现 #

遇到的问题 #

1、数值不对 #

2、查询速度慢 #

最后 #