TAstic: Simple similarity detection¶
Release: | 0.2.0 |
---|---|
Date: | August 21, 2016 |
Motivation¶
To the author’s knowledge, there are no free/libre plagiarism detection systems
currently in use. TAstic
provides a very simple and reasonably effective
version of such a system (currently less than 300 lines of source code), under
GPL and designed to be easy to extend and modify. It exists both as a Python
library and a command-line tool, has no external dependencies other than a
Python interpreter, and may easily be incorporated into other software.
Command-line usage¶
The simplest way to invoke TAstic is by simply providing it a list of files to compare for similarity:
tastic file1 file2 ...
The provided files should be plain-text; for details of the comparison
performed, see below. If any pair of the provided files is found to have a
similarity above a certain threshold, a report is printed to stdout. The
threshold is adjustable via the -t
flag; for example, to provide a report
for each pair, run:
tastic -t 0 file1 file2 ...
By default the threshold is 25%. Again, for interpretation of the threshold see below.
tastic
can also search a list of provided directories for a file with a
given name using the -a
option:
tastic -a filename dir1 dir2 ...
The above command will perform the same operation as
tastic dir1/filename dir2/filename ...
The argument to -a
can be any Python-style regex; for example
tastic -a '(file)+name' dir1 dir2
will match dir1/filefilename
and dir2/filename
. Only the first match
found in each directory will be used. If there are no matches in a particular
directory, a warning will be issued over stderr, and that directory will be
ignored.
Comparison algorithm¶
Similarity metric¶
The similarity between two documents is reported as a number between 0 and 1. That number is derived as if by the following algorithm:
- Split each document into sequences of three words. For example, “The quick brown fox” would become “The quick brown”, “quick brown fox”.
- Count the number of unique 3-sequences the documents have in common.
- Divide that number by the total number of unique 3-sequences in both documents.
Determining the metric¶
A naive translation of the above algorithm would require \(O(k^²n)\) time for \(k\) documents of length \(n\); it would additionally require \(O(kn)\) space. This is unacceptably high for large \(k\) and \(n\); instead, we use the MinHash algorithm to get a good approximation of the metric in \(O(k^2+n)\) time and \(O(k)\) space.