Cara mengukur "penyortiran"


34

Saya bertanya-tanya apakah ada cara standar untuk mengukur "pengurutan" array? Apakah array yang memiliki jumlah rata-rata kemungkinan inversi dianggap tidak tersortir secara maksimal? Maksud saya pada dasarnya sejauh mungkin dari yang diurutkan atau dibalik diurutkan.

Jawaban:


31

Tidak, itu tergantung pada aplikasi Anda. Ukuran penyortiran sering disebut sebagai ukuran gangguan , yang merupakan fungsi dari ke R , di mana N < N adalah kumpulan dari semua urutan terbatas dari bilangan bulat non-negatif yang berbeda. Survei oleh Estivill-Castro dan Wood [1] daftar dan membahas 11 ukuran gangguan yang berbeda dalam konteks algoritma pengurutan adaptif.N<NRN<N

Jumlah inversi mungkin berfungsi untuk beberapa kasus, tetapi kadang-kadang tidak cukup. Contoh yang diberikan dalam [1] adalah urutannya

n/2+1,n/2+2,,n,1,,n/2

that has a quadratic number of inversions, but only consists of two ascending runs. It is nearly sorted, but this is not captured by inversions.


[1] Estivill-Castro, Vladmir, and Derick Wood. "A survey of adaptive sorting algorithms." ACM Computing Surveys (CSUR) 24.4 (1992): 441-476.


2
The context is trying to understand why quicksort performs relatively poorly on random permutations of n elements where the number of inversions is close to the median.
Robert S. Barnes

1
Great example, that's exactly the info I was looking for.
Robert S. Barnes

1
Estivill-Castro and Wood is THE reference for this for sure.
Pedro Dusso

10

Mannila [1] axiomatizes presortedness (with a focus on comparison-based algorithms) as follows (paraphrasing).

Let Σ a totally ordered set. Then a mapping m from Σ (the sequences of distinct elements from Σ) to the naturals is a measure of presortedness if it satisfies below conditions.

  1. If XΣ is sorted then m(X)=0.

  2. If X,YΣ with X=x1xn, Y=y1yn and xi<xiyi<yj for all i,j[1..n], then m(X)=m(Y).

  3. If X is a subsequence of YΣ, then m(X)m(Y).

  4. If xi<yj for all i[1..|X|] and j[1..|Y|] for some X,YΣ, then m(XY)m(X)+m(Y).

  5. m(aX)|X|+m(X) for all XΣ and aEX.

Examples of such measures are the

  • number of inversions,
  • number of swaps,
  • the number of elements that are not left-to-right maxima, and
  • the length of a longest increasing subsequence (subtracted from the input length).

Note that random distributions using these measures have been defined, i.e. such that make sequences that are more/less sorted more or less likely. These are called Ewens-like distributions [2, Ch. 4-5; 3, Example 12; 4], a special case of which is the so-called Mallows distribution. The weights are parametric in a constant θ>0 and fulfill

Pr(X)=θm(X)YΣΣ|X|θm(Y).

Note how θ=1 defines the uniform distribution (for all m).

Since it is possible to sample permutations w.r.t. these measures efficiently, this body of work can be useful in practice when benchmarking sorting algorithms.


  1. Measures of Presortedness and Optimal Sorting Algorithms by H. Mannila (1985)
  2. Logarithmic combinatorial structures: a probabilistic approach by R. Arratia, A.D. Barbour and S. Tavaré (2003)
  3. On adding a list of numbers (and other one-dependent determinantal processes) by A. Borodin, P. Diaconis and J. Fulman (2010)
  4. Ewens-like distributions and Analysis of Algorithms by N. Auger et al. (2016)

3

I have my own definition of "sortedness" of a sequence.

Given any sequence [a,b,c,…] we compare it with the sorted sequence containing the same elements, count number of matches and divide it by the number of elements in the sequence.

For example, given sequence [5,1,2,3,4] we proceed as follows:

1) sort the sequence: [1,2,3,4,5]

2) compare the sorted sequence with the original by moving it one position at a time and counting the maximal number of matches:

        [5,1,2,3,4]
[1,2,3,4,5]                            one match

        [5,1,2,3,4]
  [1,2,3,4,5]                          no matches

        [5,1,2,3,4]
    [1,2,3,4,5]                        no matches

        [5,1,2,3,4]
      [1,2,3,4,5]                      no matches

        [5,1,2,3,4]
        [1,2,3,4,5]                    no matches

        [5,1,2,3,4]
          [1,2,3,4,5]                  4 matches

        [5,1,2,3,4]
            [1,2,3,4,5]                no matches

                ...

         [5,1,2,3,4]
                 [1,2,3,4,5]            no matches

3) The maximal number of matches is 4, we can calculate the "sortedness" as 4/5 = 0.8.

Sortedness of a sorted sequence would be 1, and sortedness of a sequence with elements placed in reversed order would be 1/n.

The idea behind this definition is to estimate the minimal amount of work we would need to do to convert any sequence to the sorted sequence. In the example above we need to move just one element, the 5 (there are many ways, but moving 5 is the most efficient). When the elements would be placed in reversed order, we would need to move 4 elements. And when the sequence were sorted, no work is needed.

I hope my definition makes sense.


Nice idea. A similar definition is Exc, the third definition of disorder in the paper mentioned in Juho's answer. Exc is the number of operations required to rearrange a sequence into sorted order.
Apass.Jack

Well, may be, I just applied my understanding of entropy and disorder to the sequence of elements :-)
Andrushenko Alexander

-2

If you need something quick and dirty (summation signs scare me) I wrote a super easy disorder function in C++ for a Class named Array which generates int arrays filled with randomly generated numbers:

void Array::disorder() {
    double disorderValue = 0;
    int counter = this->arraySize;
    for (int n = 0; n < this->arraySize; n++) {
        disorderValue += abs(((n + 1) - array[n]));
//      cout << "disorderValue variable test value = " << disorderValue << endl;
        counter++;
    }
    cout << "Disorder Value = " << (disorderValue / this->arraySize) / (this->arraySize / 2) << "\n" << endl;
}

Function simply compares the value in each element to the index of the element + 1 so that an array in reverse order has a disorder value of 1, and a sorted array has a disorder value of 0. Not sophisticated, but working.

Michael


This is not a programming site. It would have sufficed to define the disorder notion, and to mention that it can be computed in linear time.
Yuval Filmus
Dengan menggunakan situs kami, Anda mengakui telah membaca dan memahami Kebijakan Cookie dan Kebijakan Privasi kami.
Licensed under cc by-sa 3.0 with attribution required.