Pertama, jika Anda ingin mengekstrak fitur penghitungan dan menerapkan normalisasi TF-IDF dan normalisasi euclidean berdasarkan baris, Anda dapat melakukannya dalam satu operasi dengan TfidfVectorizer
:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty = fetch_20newsgroups()
>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)
>>> tfidf
<11314x130088 sparse matrix of type '<type 'numpy.float64'>'
with 1787553 stored elements in Compressed Sparse Row format>
Sekarang untuk mencari jarak cosinus dari satu dokumen (misalnya yang pertama dalam dataset) dan yang lainnya, Anda hanya perlu menghitung perkalian titik dari vektor pertama dengan yang lainnya karena vektor tfidf sudah dinormalisasi baris.
Seperti dijelaskan oleh Chris Clark dalam komentar dan di sini Cosine Similarity tidak memperhitungkan besarnya vektor. Baris dinormalisasi memiliki besaran 1 sehingga Linear Kernel cukup untuk menghitung nilai kesamaan.
API matriks jarang scipy agak aneh (tidak sefleksibel array numpy dimensi-N yang padat). Untuk mendapatkan vektor pertama, Anda perlu mengiris baris matriks untuk mendapatkan submatriks dengan satu baris:
>>> tfidf[0:1]
<1x130088 sparse matrix of type '<type 'numpy.float64'>'
with 89 stored elements in Compressed Sparse Row format>
scikit-learn sudah menyediakan metrik berpasangan (alias kernel dalam bahasa machine learning) yang berfungsi untuk representasi koleksi vektor yang padat dan jarang. Dalam hal ini kita membutuhkan perkalian titik yang juga dikenal sebagai kernel linier:
>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602,
0.04457106, 0.03293218])
Karenanya untuk menemukan 5 dokumen terkait teratas, kita dapat menggunakan argsort
dan beberapa pemotongan array negatif (sebagian besar dokumen terkait memiliki nilai kesamaan kosinus tertinggi, maka di akhir array indeks yang diurutkan):
>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([ 0, 958, 10576, 3277])
>>> cosine_similarities[related_docs_indices]
array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])
Hasil pertama adalah pemeriksaan kewarasan: kami menemukan dokumen kueri sebagai dokumen yang paling mirip dengan skor kesamaan kosinus 1 yang memiliki teks berikut:
>>> print twenty.data[0]
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----
Dokumen kedua yang paling mirip adalah balasan yang mengutip pesan asli sehingga memiliki banyak kata yang sama:
>>> print twenty.data[958]
From: rseymour@reed.edu (Robert Seymour)
Subject: Re: WHAT car is this!?
Article-I.D.: reed.1993Apr21.032905.29286
Reply-To: rseymour@reed.edu
Organization: Reed College, Portland, OR
Lines: 26
In article <1993Apr20.174246.14375@wam.umd.edu> lerxst@wam.umd.edu (where's my
thing) writes:
>
> I was wondering if anyone out there could enlighten me on this car I saw
> the other day. It was a 2-door sports car, looked to be from the late 60s/
> early 70s. It was called a Bricklin. The doors were really small. In
addition,
> the front bumper was separate from the rest of the body. This is
> all I know. If anyone can tellme a model name, engine specs, years
> of production, where this car is made, history, or whatever info you
> have on this funky looking car, please e-mail.
Bricklins were manufactured in the 70s with engines from Ford. They are rather
odd looking with the encased front bumper. There aren't a lot of them around,
but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a
performance Ford with new styling slapped on top.
> ---- brought to you by your neighborhood Lerxst ----
Rush fan?
--
Robert Seymour rseymour@reed.edu
Physics and Philosophy, Reed College (NeXTmail accepted)
Artificial Life Project Reed College
Reed Solar Energy Project (SolTrain) Portland, OR