Hash Join vs Hash Semi Join

PostgreSQL 9.2

Saya mencoba memahami perbedaan antara Hash Semi Joindan adil Hash Join.

Inilah dua pertanyaan:

saya

EXPLAIN ANALYZE SELECT * FROM orders WHERE customerid IN (SELECT
customerid FROM customers WHERE state='MD');

Hash Semi Join  (cost=740.34..994.61 rows=249 width=30) (actual time=2.684..4.520 rows=120 loops=1)
  Hash Cond: (orders.customerid = customers.customerid)
  ->  Seq Scan on orders  (cost=0.00..220.00 rows=12000 width=30) (actual time=0.004..0.743 rows=12000 loops=1)
  ->  Hash  (cost=738.00..738.00 rows=187 width=4) (actual time=2.664..2.664 rows=187 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 7kB
        ->  Seq Scan on customers  (cost=0.00..738.00 rows=187 width=4) (actual time=0.018..2.638 rows=187 loops=1)
              Filter: ((state)::text = 'MD'::text)
              Rows Removed by Filter: 19813

EXPLAIN ANALYZE SELECT * FROM orders o JOIN customers c ON o.customerid = c.customerid WHERE c.state = 'MD'

Hash Join  (cost=740.34..1006.46 rows=112 width=298) (actual time=2.831..4.762 rows=120 loops=1)
  Hash Cond: (o.customerid = c.customerid)
  ->  Seq Scan on orders o  (cost=0.00..220.00 rows=12000 width=30) (actual time=0.004..0.768 rows=12000 loops=1)
  ->  Hash  (cost=738.00..738.00 rows=187 width=268) (actual time=2.807..2.807 rows=187 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 37kB
        ->  Seq Scan on customers c  (cost=0.00..738.00 rows=187 width=268) (actual time=0.018..2.777 rows=187 loops=1)
              Filter: ((state)::text = 'MD'::text)
              Rows Removed by Filter: 19813

Seperti yang dapat dilihat, satu-satunya perbedaan dalam rencana adalah bahwa dalam kasus pertama, konsumsi cepat 7kB, tetapi dalam kedua 37kBdan bahwa node Hash Semi Join.

Tapi saya tidak mengerti perbedaan dalam ukuran hashtable. The Hashsimpul menggunakan sempurna yang sama Seq Scansimpul memiliki yang sama Filter. Mengapa ada perbedaan?

postgresql join hashing

— St.Antario
sumber

Sudahkah Anda melihat output sebenarnya dari query? Atau, gunakan explain (analyze, verbose).

— jjanes

Dalam kueri pertama, hanya customer_id yang perlu disimpan dari customerstabel hash, karena hanya itu data yang diperlukan untuk mengimplementasikan semi-join.

Dalam kueri kedua, semua kolom harus disimpan ke dalam tabel hash, karena Anda memilih semua kolom dari tabel (menggunakan *) daripada hanya menguji keberadaan customer_id.

— jjanes
sumber