返回

SQL 查询优化:解决司机周里程统计超限问题

mysql

解决数据库查询中排除未达标周总计的问题

这个问题看着眼熟,"Subquery returned more than 1 value" 这个错误在 SQL 查询里经常遇到。问题的核心在于,你想筛选出每周行驶里程超过 1000 英里的司机的数据,但在 WHERE 子句中使用了子查询,而这个子查询返回了多个值,导致错误。

问题原因

错误出在 WHERE 子句的子查询上。这个子查询:

SELECT (
     sum(TotalMiles)
    FROM #TEMP4
    GROUP BY Driver, DATEADD(wk,DATEDIFF(wk,0,shipdate),0)-1)
    Having sum(TotalMiles) > 1000

试图找出每个司机每周的总里程,并筛选出大于 1000 的。但是,WHERE 子句需要的是一个针对每一行的条件(真或假),而这个子查询返回的是一组值(每个司机每周的总里程)。 当子查询有having时, 返回的是一个集合, 显然,SQL 不知道如何将一行数据和一个集合进行比较, 这就好比不能将张三与一个学校所有学生比较一样,这就是错误的原因。

解决方案

解决这个问题的关键在于,如何把“每周总里程超过 1000”这个条件应用到每一行数据上。这提供几种方案。

方案 1:使用窗口函数 (推荐)

窗口函数是处理这类问题的好帮手。它允许你在不分组的情况下计算聚合值(比如每周总里程),并将这个值添加到每一行。

WITH WeeklyTotals AS (
    SELECT
        InvoiceNumber,
        Dataflow,
        BillTo,
        ShipDate,
        cy.cmp_name AS ShipperName,
        c.cty_name AS OriginCity,
        OriginState,
        c.cty_zip AS OriginZip,
        c.cty_latitude AS OriginLatitude,
        c.cty_longitude AS OriginLongitude,
        c.cty_region2 AS OriginRegion,
        cy1.cmp_name AS ConsigneeName,
        c1.cty_name AS DestCity,
        DestState,
        c1.cty_zip AS DestZip,
        c1.cty_latitude AS DestLatitude,
        c1.cty_longitude AS DestLongitude,
        c1.cty_region2 AS DestRegion,
        c.cty_region2 + ' || ' + c1.cty_region2 AS Lane,
        Driver,
        Delivery,
        WeekStart,
        SUM(Vol) AS Vol,
        SUM(IBVol) AS IBVol,
        SUM(OBVol) AS OBVol,
        CASE WHEN Dataflow = 'Outbound' THEN SUM(Match) ELSE 0 END AS Match,
        SUM(Weight) AS Weight,
        SUM(Tons) AS Tons,
        SUM(TotalMiles) AS TotalMiles,
        SUM(Miles) AS Miles,
        SUM(LDMiles) AS LDMiles,
        SUM(MTMiles) AS MTMiles,
        SUM(LDMiles) + SUM(MTMiles) AS [Total Miles],
        SUM(MinCharge) AS MinCharge,
        SUM(Pickups) AS Pickups,
        SUM(Linehaul) AS Linehaul,
        SUM(FSC) AS FSC,
        SUM(Fixed) AS Fixed,
        SUM(Tractors) AS Tractors,
        SUM([Tractor Cost]) AS [Tractor Cost],
        SUM(Trailers) AS Trailers,
        SUM([Trailer Cost]) AS [Trailer Cost],
        SUM(Tractors) AS Drivers,
        SUM(TotalMiles) OVER (PARTITION BY Driver, DATEADD(wk, DATEDIFF(wk, 0, shipdate), 0) - 1) AS WeeklyDriverMiles -- 使用窗口函数计算每周总里程
    FROM #temp3
    INNER JOIN company cy (NOLOCK) ON Shipper = cy.cmp_id
    INNER JOIN company cy1 (NOLOCK) ON Consignee = cy1.cmp_id
    INNER JOIN city c (NOLOCK) ON origincity = c.cty_code
    INNER JOIN city c1 (NOLOCK) ON DestCity = c1.cty_code
    GROUP BY WeekStart, DataFlow, Delivery, ShipDate, BillTo, Driver, Shipper, OriginCity, OriginState, Consignee, DestCity, DestState, InvoiceNumber, c.cty_latitude, c.cty_longitude,
             cy.cmp_name, cy1.cmp_name, c1.cty_name, c.cty_name, c1.cty_latitude, c1.cty_longitude, c1.cty_zip, c.cty_zip, c1.cty_region2, c.cty_region2
)
SELECT *
FROM WeeklyTotals
WHERE WeeklyDriverMiles > 1000;

代码解释:

  1. WITH WeeklyTotals AS (...) : 这定义了一个公用表表达式 (CTE),名为 WeeklyTotals。CTE 就像一个临时的、只在当前查询中有效的表。
  2. SUM(TotalMiles) OVER (PARTITION BY Driver, DATEADD(wk, DATEDIFF(wk, 0, shipdate), 0) - 1) AS WeeklyDriverMiles : 这是关键!
    • SUM(TotalMiles): 计算总里程。
    • OVER (...): 表示这是一个窗口函数。
    • PARTITION BY Driver, DATEADD(wk, DATEDIFF(wk, 0, shipdate), 0) - 1): 这告诉 SQL Server 如何“划分”数据。
    • 这里表示,按照“Driver”(司机)和每周的开始日期(通过 DATEADD(wk, DATEDIFF(wk, 0, shipdate), 0) - 1 计算)来分组。 对于每个司机和每周的开始日期,都会单独计算 SUM(TotalMiles)
    • 结果是,WeeklyDriverMiles 列包含了每个司机每周的总里程。
  3. SELECT * FROM WeeklyTotals WHERE WeeklyDriverMiles > 1000 : 从 WeeklyTotals CTE 中选择数据,并筛选出 WeeklyDriverMiles 大于 1000 的行。

原理:

窗口函数的好处在于,它在计算聚合值(如 SUM)时,不会像 GROUP BY 那样把数据行“折叠”成一行。它会把聚合值添加到每一行,这样你就可以在 WHERE 子句中直接使用这个值进行筛选。

安全建议:

使用NOLOCK需要谨慎,尤其是在写入频繁的表上。它可能会导致读取到未提交的数据(脏读)。 如果数据一致性很重要, 请移除 (NOLOCK)。不过就本问题而言,可以继续使用 (NOLOCK),影响很小。

方案 2:使用 JOIN

这种方法先把每周总里程超过 1000 的司机和周起始日期筛选出来,然后再与原表连接。

WITH DriverWeeklyTotals AS (
    SELECT
        Driver,
        DATEADD(wk, DATEDIFF(wk, 0, shipdate), 0) - 1 AS WeekStart
    FROM #temp3
    GROUP BY Driver, DATEADD(wk, DATEDIFF(wk, 0, shipdate), 0) - 1
    HAVING SUM(TotalMiles) > 1000
)
SELECT
    t4.*
FROM #temp4 t4
INNER JOIN DriverWeeklyTotals dwt ON t4.Driver = dwt.Driver AND DATEADD(wk, DATEDIFF(wk, 0, t4.shipdate), 0) - 1 = dwt.WeekStart;

代码解释:

  • 首先构建 DriverWeeklyTotals 表, 根据 DriverWeekStart 进行分组,利用 having 语句筛选出大于1000的周数据.
  • #temp4 表与 DriverWeeklyTotals通过 DriverWeekStart 两个字段进行连接。这样可以得到每周形式里超过1000,且在#temp4的对应数据.

原理:

先计算出符合条件的司机和周起始日期组合,然后通过 JOIN 操作将这些组合与原始数据进行匹配,只保留符合条件的行。

方案 3:使用 EXISTS

这种方法使用 EXISTS 子查询来检查每一行是否存在于一个满足条件的子查询中。

SELECT
    t4.*
FROM #temp4 t4
WHERE EXISTS (
    SELECT 1
    FROM #temp3 t3
    WHERE t3.Driver = t4.Driver
      AND DATEADD(wk, DATEDIFF(wk, 0, t3.shipdate), 0) - 1 = DATEADD(wk, DATEDIFF(wk, 0, t4.shipdate), 0) - 1
    GROUP BY Driver, DATEADD(wk, DATEDIFF(wk, 0, t3.shipdate), 0) - 1
    HAVING SUM(t3.TotalMiles) > 1000
);

代码解释:

  • #temp4的每一行数据, 都会进行一次EXISTS后的子查询.
  • 在该子查询中, 首先使用 WHERE 语句匹配#temp3中的相同 DriverWeekStart的数据, 然后对Driver, WeekStart进行分组, 判断该分组下SUM(t3.TotalMiles) 是否大于1000.
  • 大于1000时,EXISTS 子查询有返回, 外部 SELECT 就选出该行数据, 否则不选择.

原理:
类似JOIN的原理, 不过使用的是EXISTS. EXISTSJOIN 更快一点, 因为数据库只要找到任何一行匹配 EXISTS 子查询就会停止搜索。

方案4 (可选, 基于现有代码结构): 先创建表,再插入数据

考虑到你已经有了 #temp4,可以将计算周总里程并筛选的逻辑放在插入 #temp4 数据的过程中。

-- 先创建一个包含周总里程的临时表
SELECT
    WeekStart,
    DataFlow,
    Delivery,
    ShipDate,
    BillTo,
    Driver,
    Shipper,
    OriginCity,
    OriginState,
    Consignee,
    DestCity,
    DestState,
    InvoiceNumber,
    c.cty_latitude,
    c.cty_longitude,
    cy.cmp_name,
    cy1.cmp_name,
    c1.cty_name,
    c.cty_name,
    c1.cty_latitude,
    c1.cty_longitude,
    c1.cty_zip,
    c.cty_zip,
    c1.cty_region2,
    c.cty_region2,
    SUM(Vol) AS Vol,
    SUM(IBVol) AS IBVol,
    SUM(OBVol) AS OBVol,
    CASE WHEN Dataflow = 'Outbound' THEN SUM(Match) ELSE 0 END AS Match,
    SUM(Weight) AS Weight,
    SUM(Tons) AS Tons,
    SUM(TotalMiles) AS TotalMiles,
    SUM(Miles) AS Miles,
    SUM(LDMiles) AS LDMiles,
    SUM(MTMiles) AS MTMiles,
    SUM(LDMiles) + SUM(MTMiles) AS [Total Miles],
    SUM(MinCharge) AS MinCharge,
    SUM(Pickups) AS Pickups,
    SUM(Linehaul) AS Linehaul,
    SUM(FSC) AS FSC,
    SUM(Fixed) AS Fixed,
    SUM(Tractors) AS Tractors,
    SUM([Tractor Cost]) AS [Tractor Cost],
    SUM(Trailers) AS Trailers,
    SUM([Trailer Cost]) AS [Trailer Cost],
    SUM(Tractors) AS Drivers,
     WeeklyDriverMiles -- 直接在这里计算周总里程
INTO #temp4_with_weekly
FROM (
    SELECT *,
           SUM(TotalMiles) OVER (PARTITION BY Driver, DATEADD(wk, DATEDIFF(wk, 0, shipdate), 0) - 1) AS WeeklyDriverMiles
    FROM #temp3
) AS Subquery

INNER JOIN company cy (NOLOCK) ON Shipper = cy.cmp_id
INNER JOIN company cy1 (NOLOCK) ON Consignee = cy1.cmp_id
INNER JOIN city c (NOLOCK) ON origincity = c.cty_code
INNER JOIN city c1 (NOLOCK) ON DestCity = c1.cty_code

GROUP BY WeekStart, DataFlow, Delivery, ShipDate, BillTo, Driver, Shipper, OriginCity, OriginState, Consignee, DestCity, DestState, InvoiceNumber, c.cty_latitude, c.cty_longitude,
         cy.cmp_name, cy1.cmp_name, c1.cty_name, c.cty_name, c1.cty_latitude, c1.cty_longitude, c1.cty_zip, c.cty_zip, c1.cty_region2, c.cty_region2, WeeklyDriverMiles;

-- 再筛选数据插入到最终的 #temp4 表
INSERT INTO #temp4
SELECT *
FROM #temp4_with_weekly
WHERE WeeklyDriverMiles > 1000;

-- 最后查询 #temp4
SELECT *
FROM #temp4;

代码解释

  1. 创建 #temp4_with_weekly时就计算WeeklyDriverMiles
  2. 使用WHERE WeeklyDriverMiles > 1000插入#temp4.
  3. 查询最终表.

原理: 将筛选与临时表创建合并, 更精简. 注意如果 #temp4 已存在,需要先删除。

总结与建议:

  • 首选窗口函数(方案 1) :这是最简洁、最有效的方式。
  • JOIN(方案 2)和 EXISTS(方案 3)也可以 :它们在逻辑上更直观一些,但性能可能稍逊于窗口函数。
  • 方案4是一种将筛选与构建临时表相结合的方法, 代码量少.

请根据你的具体情况和偏好选择合适的方案。强烈建议测试不同方案的性能,尤其是在数据量很大的情况下。