Solusi untuk menetapkan nilai unik ke baris dengan jarak kolaborasi terbatas

9

Saya memiliki tabel yang dapat dibuat dan diisi dengan kode berikut:

CREATE TABLE dbo.Example(GroupKey int NOT NULL, RecordKey varchar(12) NOT NULL);
ALTER TABLE dbo.Example
    ADD CONSTRAINT iExample PRIMARY KEY CLUSTERED(GroupKey ASC, RecordKey ASC);
INSERT INTO dbo.Example(GroupKey, RecordKey)
VALUES (1, 'Archimedes'), (1, 'Newton'), (1, 'Euler'), (2, 'Euler'), (2, 'Gauss'),
       (3, 'Gauss'), (3, 'Poincaré'), (4, 'Ramanujan'), (5, 'Neumann'),
       (5, 'Grothendieck'), (6, 'Grothendieck'), (6, 'Tao');

Untuk semua baris yang memiliki jarak kolaborasi terbatas berdasarkan RecordKeydengan baris lain, saya ingin menetapkan nilai unik — saya tidak peduli bagaimana atau tipe data apa nilai uniknya.

Rangkaian hasil yang benar yang memenuhi apa yang saya minta dapat dihasilkan dengan kueri berikut:

SELECT 1 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(1, 2, 3)
UNION ALL
SELECT 2 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey = 4
UNION ALL
SELECT 3 AS SupergroupKey, GroupKey, RecordKey
FROM dbo.Example
WHERE GroupKey IN(5, 6)
ORDER BY SupergroupKey ASC, GroupKey ASC, RecordKey ASC;

Untuk lebih membantu apa yang saya minta, saya akan menjelaskan mengapa GroupKey1–3 memiliki hal yang sama SupergroupKey:

GroupKey1 berisi RecordKeyEuler yang pada gilirannya terkandung dalam GroupKey2; jadi GroupKeys 1 dan 2 harus memiliki yang sama SupergroupKey.
Karena Gauss terkandung dalam GroupKeys 2 dan 3, mereka juga harus memiliki yang sama SupergroupKey. Ini mengarah ke GroupKeys 1-3 memiliki hal yang sama SupergroupKey.
Karena GroupKeys 1–3 tidak membagikan apa pun RecordKeydengan yang tersisa GroupKey, mereka adalah satu-satunya yang diberi SupergroupKeynilai 1.

Saya harus menambahkan bahwa solusinya harus generik. Tabel di atas dan set hasil hanyalah sebuah contoh.

Tambahan

Saya menghapus persyaratan agar solusi tidak berulang. Sementara saya lebih suka solusi seperti itu, saya percaya itu adalah kendala yang tidak masuk akal. Sayangnya, saya tidak dapat menggunakan solusi berbasis CLR; tetapi jika Anda ingin memasukkan solusi seperti itu, jangan ragu untuk melakukannya. Saya kemungkinan tidak akan menerimanya sebagai jawaban.

Jumlah baris di meja saya yang sebenarnya adalah sebesar 5 juta, tetapi ada hari-hari ketika jumlah baris "hanya" akan sekitar sepuluh ribu. Rata-rata ada 8 RecordKeydetik per GroupKeydan 4 GroupKeydetik per RecordKey. Saya membayangkan bahwa suatu solusi akan memiliki kompleksitas waktu yang eksponensial, tetapi saya tetap tertarik pada suatu solusi.

sql-server sql-server-2016

— bola basket22
sumber

7

Ini adalah solusi T-SQL berulang untuk perbandingan kinerja.

Itu mengasumsikan bahwa kolom tambahan dapat ditambahkan ke tabel untuk menyimpan kunci grup super, dan pengindeksan dapat diubah:

Mendirikan

DROP TABLE IF EXISTS 
    dbo.Example;

CREATE TABLE dbo.Example
(
    SupergroupKey integer NOT NULL
        DEFAULT 0, 
    GroupKey integer NOT NULL, 
    RecordKey varchar(12) NOT NULL,

    CONSTRAINT iExample 
    PRIMARY KEY CLUSTERED 
        (GroupKey ASC, RecordKey ASC),

    CONSTRAINT [IX dbo.Example RecordKey, GroupKey]
    UNIQUE NONCLUSTERED (RecordKey, GroupKey),

    INDEX [IX dbo.Example SupergroupKey, GroupKey]
        (SupergroupKey ASC, GroupKey ASC)
);

INSERT dbo.Example
    (GroupKey, RecordKey)
VALUES 
    (1, 'Archimedes'), 
    (1, 'Newton'),
    (1, 'Euler'),
    (2, 'Euler'),
    (2, 'Gauss'),
    (3, 'Gauss'),
    (3, 'Poincaré'),
    (4, 'Ramanujan'),
    (5, 'Neumann'),
    (5, 'Grothendieck'),
    (6, 'Grothendieck'),
    (6, 'Tao');

Jika Anda dapat membalik urutan kunci dari kunci utama ini, indeks unik tambahan tidak akan diperlukan.

Garis besar

Pendekatan solusi ini adalah:

Atur id grup super ke 1
Temukan kunci grup yang tidak diproses yang bernomor terendah
Jika tidak ditemukan, keluar
Atur grup super untuk semua baris dengan kunci grup saat ini
Atur grup super untuk semua baris yang terkait dengan baris di grup saat ini
Ulangi langkah 5 hingga tidak ada baris yang diperbarui
Tambahkan id grup super saat ini
Lanjutkan ke langkah 2

Penerapan

Komentar sebaris:

-- No execution plans or rows affected messages
SET NOCOUNT ON;
SET STATISTICS XML OFF;

-- Reset all supergroups
UPDATE E
SET SupergroupKey = 0
FROM dbo.Example AS E
    WITH (TABLOCKX)
WHERE 
    SupergroupKey != 0;

DECLARE 
    @CurrentSupergroup integer = 0,
    @CurrentGroup integer = 0;

WHILE 1 = 1
BEGIN
    -- Next super group
    SET @CurrentSupergroup += 1;

    -- Find the lowest unprocessed group key
    SELECT 
        @CurrentGroup = MIN(E.GroupKey)
    FROM dbo.Example AS E
    WHERE 
        E.SupergroupKey = 0;

    -- Exit when no more unprocessed groups
    IF @CurrentGroup IS NULL BREAK;

    -- Set super group for all records in the current group
    UPDATE E
    SET E.SupergroupKey = @CurrentSupergroup
    FROM dbo.Example AS E 
    WHERE 
        E.GroupKey = @CurrentGroup;

    -- Iteratively find all groups for the super group
    WHILE 1 = 1
    BEGIN
        WITH 
            RecordKeys AS
            (
                SELECT DISTINCT
                    E.RecordKey
                FROM dbo.Example AS E
                WHERE
                    E.SupergroupKey = @CurrentSupergroup
            ),
            GroupKeys AS
            (
                SELECT DISTINCT
                    E.GroupKey
                FROM RecordKeys AS RK
                JOIN dbo.Example AS E
                    WITH (FORCESEEK)
                    ON E.RecordKey = RK.RecordKey
            )
        UPDATE E WITH (TABLOCKX)
        SET SupergroupKey = @CurrentSupergroup
        FROM GroupKeys AS GK
        JOIN dbo.Example AS E
            ON E.GroupKey = GK.GroupKey
        WHERE
            E.SupergroupKey = 0
        OPTION (RECOMPILE, QUERYTRACEON 9481); -- The original CE does better

        -- Break when no more related groups found
        IF @@ROWCOUNT = 0 BREAK;
    END;
END;

SELECT
    E.SupergroupKey,
    E.GroupKey,
    E.RecordKey
FROM dbo.Example AS E;

Rencana eksekusi

Untuk pembaruan utama:

Hasil

Keadaan akhir dari tabel ini adalah:

╔═══════════════╦══════════╦══════════════╗
║ SupergroupKey ║ GroupKey ║  RecordKey   ║
╠═══════════════╬══════════╬══════════════╣
║             1 ║        1 ║ Archimedes   ║
║             1 ║        1 ║ Euler        ║
║             1 ║        1 ║ Newton       ║
║             1 ║        2 ║ Euler        ║
║             1 ║        2 ║ Gauss        ║
║             1 ║        3 ║ Gauss        ║
║             1 ║        3 ║ Poincaré     ║
║             2 ║        4 ║ Ramanujan    ║
║             3 ║        5 ║ Grothendieck ║
║             3 ║        5 ║ Neumann      ║
║             3 ║        6 ║ Grothendieck ║
║             3 ║        6 ║ Tao          ║
╚═══════════════╩══════════╩══════════════╝

Demo: db <> biola

Tes kinerja

Menggunakan set data uji diperluas yang disediakan dalam jawaban Michael Green , timing di laptop saya ^* adalah:

╔═════════════╦════════╗
║ Record Keys ║  Time  ║
╠═════════════╬════════╣
║ 10k         ║ 2s     ║
║ 100k        ║ 12s    ║
║ 1M          ║ 2m 30s ║
╚═════════════╩════════╝

_{* Microsoft SQL Server 2017 (RTM-CU13), Edisi Pengembang (64-bit), Windows 10 Pro, RAM 16GB, SSD, i7 4 core hyperthreaded, nominal 2,4GHz.}

— Paul White 9
sumber

Ini jawaban yang luar biasa. Seperti yang sudah dijelaskan dalam pertanyaan saya, terlalu lambat untuk "hari besar"; tapi ini bagus untuk masa kecilku. Butuh sekitar 5 jam untuk berjalan di meja saya ≈ 2,5 juta baris.

— basketballfan22

10

Masalah ini adalah tentang mengikuti tautan antar item. Ini menempatkannya dalam ranah grafik dan pemrosesan grafik. Secara khusus, seluruh dataset membentuk grafik dan kami sedang mencari komponen dari grafik itu. Ini dapat diilustrasikan oleh sebidang data sampel dari pertanyaan.

Pertanyaannya mengatakan kita bisa mengikuti GroupKey atau RecordKey untuk menemukan baris lain yang berbagi nilai itu. Jadi kita bisa memperlakukan keduanya sebagai simpul dalam grafik. Pertanyaan selanjutnya menjelaskan bagaimana GroupKeys 1–3 memiliki SupergroupKey yang sama. Ini dapat dilihat sebagai gugusan di sebelah kiri yang bergabung dengan garis tipis. Gambar juga menunjukkan dua komponen lain (SupergroupKey) yang dibentuk oleh data asli.

SQL Server memiliki beberapa kemampuan pemrosesan grafik yang dibangun ke T-SQL. Namun, saat ini cukup sedikit, dan tidak membantu dengan masalah ini. SQL Server juga memiliki kemampuan untuk memanggil R dan Python, dan paket yang kaya & kuat tersedia untuk mereka. Salah satunya adalah igraph . Ini ditulis untuk "penanganan cepat grafik besar, dengan jutaan simpul dan tepi ( tautan )."

Menggunakan R dan igraph saya dapat memproses satu juta baris dalam 2 menit 22 detik dalam pengujian lokal ¹ . Inilah perbandingannya dengan solusi terbaik saat ini:

Record Keys     Paul White  R               
------------    ----------  --------
Per question    15ms        ~220ms
100             80ms        ~270ms
1,000           250ms       430ms
10,000          1.4s        1.7s
100,000         14s         14s
1M              2m29        2m22s
1M              n/a         1m40    process only, no display

The first column is the number of distinct RecordKey values. The number of rows
in the table will be 8 x this number.

Saat memproses baris 1M, 1m40 digunakan untuk memuat & memproses grafik, dan memperbarui tabel. Diperlukan 42s untuk mengisi tabel hasil SSMS dengan output.

Pengamatan Task Manager sementara 1M baris diproses menunjukkan sekitar 3GB memori kerja diperlukan. Ini tersedia di sistem ini tanpa paging.

Saya bisa mengkonfirmasi penilaian Ypercube tentang pendekatan CTE rekursif. Dengan beberapa ratus tombol rekam, ia menghabiskan 100% CPU dan semua RAM yang tersedia. Akhirnya tempdb tumbuh lebih dari 80GB dan SPID jatuh.

Saya menggunakan meja Paul dengan kolom SupergroupKey sehingga ada perbandingan yang adil antara solusi.

Untuk beberapa alasan R keberatan dengan aksen pada Poincaré. Mengubahnya menjadi "e" biasa memungkinkannya untuk berjalan. Saya tidak menyelidiki karena itu tidak berhubungan dengan masalah yang dihadapi. Saya yakin ada solusinya.

Ini kodenya

-- This captures the output from R so the base table can be updated.
drop table if exists #Results;

create table #Results
(
    Component   int         not NULL,
    Vertex      varchar(12) not NULL primary key
);


truncate table #Results;    -- facilitates re-execution

declare @Start time = sysdatetimeoffset();  -- for a 'total elapsed' calculation.

insert #Results(Component, Vertex)
exec sp_execute_external_script   
    @language = N'R',
    @input_data_1 = N'select GroupKey, RecordKey from dbo.Example',
    @script = N'
library(igraph)
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)
cpts <- components(df.g, mode = c("weak"))
OutputDataSet <- data.frame(cpts$membership)
OutputDataSet$VertexName <- V(df.g)$name
';

-- Write SuperGroupKey to the base table, as other solutions do
update e
set
    SupergroupKey = r.Component
from dbo.Example as e
inner join #Results as r
    on r.Vertex = e.RecordKey;

-- Return all rows, as other solutions do
select
    e.SupergroupKey,
    e.GroupKey,
    e.RecordKey
from dbo.Example as e;

-- Calculate the elapsed
declare @End time = sysdatetimeoffset();
select Elapse_ms = DATEDIFF(MILLISECOND, @Start, @End);

Inilah yang dilakukan kode R

@input_data_1 adalah cara SQL Server mentransfer data dari tabel ke kode R dan menerjemahkannya ke bingkai data R yang disebut InputDataSet.
library(igraph) mengimpor perpustakaan ke lingkungan eksekusi R.
df.g <- graph.data.frame(d = InputDataSet, directed = FALSE)memuat data ke objek igraph. Ini adalah grafik yang tidak diarahkan karena kita dapat mengikuti tautan dari grup ke record atau record ke group. InputDataSet adalah nama default SQL Server untuk dataset yang dikirim ke R.
cpts <- components(df.g, mode = c("weak")) proses grafik untuk menemukan sub-grafik terpisah (komponen) dan langkah-langkah lainnya.
OutputDataSet <- data.frame(cpts$membership)SQL Server mengharapkan bingkai data kembali dari R. Nama default-nya adalah OutputDataSet. Komponen disimpan dalam vektor yang disebut "keanggotaan". Pernyataan ini menerjemahkan vektor ke bingkai data.
OutputDataSet$VertexName <- V(df.g)$nameV () adalah vektor simpul dalam grafik - daftar GroupKeys dan RecordKeys. Ini menyalinnya ke dalam bingkai data ouput, membuat kolom baru bernama VertexName. Ini adalah kunci yang digunakan untuk mencocokkan dengan tabel sumber untuk memperbarui SupergroupKey.

Saya bukan ahli R. Kemungkinan ini bisa dioptimalkan.

Data Uji

Data OP digunakan untuk validasi. Untuk tes skala saya menggunakan skrip berikut.

drop table if exists Records;
drop table if exists Groups;

create table Groups(GroupKey int NOT NULL primary key);
create table Records(RecordKey varchar(12) NOT NULL primary key);
go

set nocount on;

-- Set @RecordCount to the number of distinct RecordKey values desired.
-- The number of rows in dbo.Example will be 8 * @RecordCount.
declare @RecordCount    int             = 1000000;

-- @Multiplier was determined by experiment.
-- It gives the OP's "8 RecordKeys per GroupKey and 4 GroupKeys per RecordKey"
-- and allows for clashes of the chosen random values.
declare @Multiplier     numeric(4, 2)   = 2.7;

-- The number of groups required to reproduce the OP's distribution.
declare @GroupCount     int             = FLOOR(@RecordCount * @Multiplier);


-- This is a poor man's numbers table.
insert Groups(GroupKey)
select top(@GroupCount)
    ROW_NUMBER() over (order by (select NULL))
from sys.objects as a
cross join sys.objects as b
--cross join sys.objects as c  -- include if needed


declare @c int = 0
while @c < @RecordCount
begin
    -- Can't use a set-based method since RAND() gives the same value for all rows.
    -- There are better ways to do this, but it works well enough.
    -- RecordKeys will be 10 letters, a-z.
    insert Records(RecordKey)
    select
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND())) +
        CHAR(97 + (26*RAND()));

    set @c += 1;
end


-- Process each RecordKey in alphabetical order.
-- For each choose 8 GroupKeys to pair with it.
declare @RecordKey varchar(12) = '';
declare @Groups table (GroupKey int not null);

truncate table dbo.Example;

select top(1) @RecordKey = RecordKey 
from Records 
where RecordKey > @RecordKey 
order by RecordKey;

while @@ROWCOUNT > 0
begin
    print @Recordkey;

    delete @Groups;

    insert @Groups(GroupKey)
    select distinct C
    from
    (
        -- Hard-code * from OP's statistics
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
        union all
        select FLOOR(RAND() * @GroupCount)
    ) as T(C);

    insert dbo.Example(GroupKey, RecordKey)
    select
        GroupKey, @RecordKey
    from @Groups;

    select top(1) @RecordKey = RecordKey 
    from Records 
    where RecordKey > @RecordKey 
    order by RecordKey;
end

-- Rebuild the indexes to have a consistent environment
alter index iExample on dbo.Example rebuild partition = all 
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, 
      ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON);


-- Check what we ended up with:
select COUNT(*) from dbo.Example;  -- Should be @RecordCount * 8
                                   -- Often a little less due to random clashes
select 
    ByGroup = AVG(C)
from
(
    select CONVERT(float, COUNT(1) over(partition by GroupKey)) 
    from dbo.Example
) as T(C);

select
    ByRecord = AVG(C)
from
(
    select CONVERT(float, COUNT(1) over(partition by RecordKey)) 
    from dbo.Example
) as T(C);

Saya baru saja menyadari bahwa saya mendapatkan rasio yang salah dari definisi OP. Saya tidak percaya ini akan mempengaruhi timing. Catatan & Grup simetris dengan proses ini. Untuk algoritme mereka semua hanya simpul dalam grafik.

Dalam pengujian data selalu membentuk komponen tunggal. Saya percaya ini karena distribusi data yang seragam. Jika alih-alih rasio statis 1: 8 yang dikodekan ke dalam rutinitas pembangkitan, saya telah membiarkan rasionya bervariasi, lebih mungkin ada komponen lebih lanjut.

¹ Spesifikasi mesin: Microsoft SQL Server 2017 (RTM-CU12), Edisi Pengembang (64-bit), Windows 10 Home. RAM 16GB, SSD, 4 core hyperthreaded i7, 2,8GHz nominal. Pengujian adalah satu-satunya item yang berjalan pada saat itu, selain aktivitas sistem normal (sekitar 4% CPU).

— Michael Green
sumber

6

Metode CTE rekursif - yang cenderung sangat tidak efisien dalam tabel besar:

WITH rCTE AS 
(
    -- Anchor
    SELECT 
        GroupKey, RecordKey, 
        CAST('|' + CAST(GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS GroupKeys,
        CAST('|' + CAST(RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100)) AS RecordKeys,
        1 AS lvl
    FROM Example

    UNION ALL

    -- Recursive
    SELECT
        e.GroupKey, e.RecordKey, 
        CASE WHEN r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
            THEN CAST(r.GroupKeys + CAST(e.GroupKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
            ELSE r.GroupKeys
        END,
        CASE WHEN r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
            THEN CAST(r.RecordKeys + CAST(e.RecordKey AS VARCHAR(10)) + '|' AS VARCHAR(100))
            ELSE r.RecordKeys
        END,
        r.lvl + 1
    FROM rCTE AS r
         JOIN Example AS e
         ON  e.RecordKey = r.RecordKey
         AND r.GroupKeys NOT LIKE '%|' + CAST(e.GroupKey AS VARCHAR(10)) + '|%'
         -- 
         OR e.GroupKey = r.GroupKey
         AND r.RecordKeys NOT LIKE '%|' + CAST(e.RecordKey AS VARCHAR(10)) + '|%'
)
SELECT 
    ROW_NUMBER() OVER (ORDER BY GroupKeys) AS SuperGroupKey,
    GroupKeys, RecordKeys
FROM rCTE AS c
WHERE NOT EXISTS
      ( SELECT 1
        FROM rCTE AS m
        WHERE m.lvl > c.lvl
          AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
        OR    m.lvl = c.lvl
          AND ( m.GroupKey > c.GroupKey
             OR m.GroupKey = c.GroupKey
             AND m.RecordKeys > c.RecordKeys
              )
          AND m.GroupKeys LIKE '%|' + CAST(c.GroupKey AS VARCHAR(10)) + '|%'
          AND c.GroupKeys LIKE '%|' + CAST(m.GroupKey AS VARCHAR(10)) + '|%'
      ) 
OPTION (MAXRECURSION 0) ;

Diuji dalam dbfiddle.uk

— ypercubeᵀᴹ
sumber