Code::DRY(3pm) | User Contributed Perl Documentation | Code::DRY(3pm) |
NAME¶
Code::DRY - Cut-and-Paste-Detector for Perl code
SYNOPSIS¶
use Code::DRY; # high level usage: scan some directories for Perl code # and let the module report duplicates sorted # by length of duplicates. Minimum length are 4 lines, # and all filters are set to undef. # Code::DRY::scan_directories(4, undef, undef, undef, @dirs); or # mid level usage: let the function report duplicates # from a list of files. The ignore filter is set to undef. # This time the minimum length is set to 40 bytes. Code::DRY::find_duplicates_in(-40, undef, @files); or # low level usage: analyse the raw data yourself # built the suffix and lcp array Code::DRY::build_suffixarray_and_lcp($longstringwithcode); # avoid matches crossing file boundaries Code::DRY::clip_lcp_to_fileboundaries(\@Code::DRY::fileoffsets); # avoid matches overlapping into each other Code::DRY::reduce_lcp_to_nonoverlapping_lengths(); # avoid matches that are included in longer matches Code::DRY::set_lcp_to_zero_for_shadowed_substrings(); # then iterate through the lcp array via get_len_at(index) # and through the suffix/offset array via get_offset_at(index)
DESCRIPTION¶
The module's main purpose is to report repeated text fragments (typically Perl code) that could be considered for isolation and/or abstraction in order to reduce multiple copies of the same code (aka cut and paste code).
Code duplicates may occur in the same line, file or directory.
The ad hoc approach to compare every item against every other item leads to computing times growing exponentially with the amount of code, which is not useful for anything but the smallest code bases.
So a efficient data structure is needed.
This module can create the suffix array and the longest common prefix array for a string of 8-bit characters. These data structures can be used to search for repetitions of substrings in O(n) time.
The current strategy is to concatenate code from all files into one string and then use the suffix array and its companion, the longest-common-prefix (lcp) array on this string.
Example:
Instead of real Perl code I use the string 'mississippi' for simplicity. A suffix is a partial string of an input string, which ends at the end of the input string. A prefix is a partial string of an input string, which starts at the start of the input string. The suffix array of a string is a list of offsets (each one for a suffix), which is sorted lexicographically by suffix:
# offset suffix ================ 0 10: i 1 7: ippi 2 4: issippi 3 1: ississippi 4 0: mississippi 5 9: pi 6 8: ppi 7 6: sippi 8 3: sissippi 9 5: ssippi 10 2: ssissippi
The other structure needed is the longest common prefix array (lcp). It contains the maximal length of the prefixes for this entry shared with the previous entry from the suffix array. For this example it looks like this:
# offset lcp (common prefixes shown in ()) ===================== 0 10: 0 () 1 7: 1 (i) 2 4: 1 (i) 3 1: 4 (issi) overlap! 3 3 (iss) corrected non overlapping prefixes 4 0: 0 () 5 9: 0 () 6 8: 1 (p) 7 6: 0 () 8 3: 2 (si) 9 5: 1 (s) 10 2: 3 (ssi)
The standard lcp array may contain overlapping prefixes, but for our purposes we need only non overlapping prefixes lengths. The same overlap may occur for prefixes that extend from the end of one source file to the start of the next file when we use concatenated content of source files. The limiting with respect to internal overlaps and file crossing prefix lengths is done by two respective functions afterwards.
If we sort the so obtained lcp values in descending order we get
# offset lcp (prefix shown in ()) =================================== 3 1: 3 (iss) now corrected to non overlapping prefixes 10 2: 3 (ssi) 8 3: 2 (si) 1 7: 1 (i) 2 4: 1 (i) 6 8: 1 (p) 9 5: 1 (s) 0 10: 0 () 4 0: 0 () 5 9: 0 () 7 6: 0 ()
The first entry shows the longest repetition in the given string. Not all entries are of interest since smaller copies are contained in the longest match. After removing all 'shadowed' repetitions, the next entry can be reported. Finally the lcp values are too small to be of any interest.
Currently this is experimental code.
The most appropriate mailing list on which to discuss this module would be perl-qa. See <http://lists.perl.org/list/perl-qa.html>.
REQUIREMENTS & OPTIONAL MODULES¶
REQUIREMENTS¶
- The ability to compile XS extensions.
This means a working C compiler, a linker, a "make" program etc. If you built perl from source you will have these already and they will be used automatically. If your perl was built in some other way, for example you may have installed it using your Operating System's packaging mechanism, you will need to ensure that the appropriate tools are installed.
- Module "File::Find"
This is a core module now for a while.
OPTIONAL MODULES¶
- Test::More
Required if you want to run Code::DRY's own tests.
- Test::Output
Optional if you want to run Code::DRY's own tests.
SUBROUTINES¶
"scan_directories($minlength, $ignoreContent, $regexAccept, $regexIgnore, @array_of_pathnames_of_directories)"¶
Scans the given directories in @array_of_pathnames_of_directories recursively for file names matching the regexp $regexAccept, if it is defined. If those file names also do not match against the regexp $regexIgnore (unless $regexIgnore is undefined) they are included in the analysis.
If $regexAccept and $regexIgnore both are "undef", all file names will be considered for analysis.
If either of $regexAccept and $regexIgnore is not a ref of type 'Regexp', it is expected to be a pattern string that will be converted into a regexp with "qr{}xms".
The parameter $ignoreContent can be used to avoid duplication reports for content matching this regex. If $ignoreContent is not a ref of type 'Regexp', it is expected to be a pattern string that will be converted into a regexp with "qr{}xms".
The parameter $minlength is interpreted in units of lines when being positive. Otherwise its absolute value is interpreted in units of bytes.
All repetitions with a minimum length of $minlength will be reported by the "report" callback function.
"find_duplicates_in($minlength, $ignoreContent, @array_of_pathnames_of_files)"¶
Reads files for the given file names composing a long string, which is then analysed for repetitions.
The parameter $minlength is interpreted in units of lines when being positive. Otherwise its absolute value is interpreted in units of bytes.
The parameter $ignoreContent can be used to avoid duplication reports for content matching this regex. If $ignoreContent is not a ref of type 'Regexp', it is expected to be a pattern string that will be converted into a regexp with "qr{}xms".
All repetitions with a minimum length of $minlength will be reported by the "report" callback function.
"set_reporter(sub{ CODE BLOCK })"¶
Set custom code to report duplicates of a code fragment. The callback is invoked with position information for the copies found during analysis.
The supplied code has to accept two scalars and an array reference.
The first parameter is the required minimum length of duplicates to be reported.
The second parameter contains a string describing the units for minimum length ('lines' or 'bytes').
The referenced array (third parameter) contains one entry with an anonymous array reference for each copy found. Copies are reported in the order of the processing of the files and then in the order of positions. Each copy is represented by this position information as an array entry:
- 1. filename
- 2. line number at start of copy (starting with 1). This is the line number of the first line completely contained in the copy.
- 3. line number at end of copy. This is the line number of the last line completely contained in the copy.
- 4. offset from start of file at start of copy (starting with 0) clipped to the next completely contained line.
- 5. offset from start of file at end of copy clipped to the last completely contained line.
- 6. offset from start of file at start of copy (starting with 0) (used in 'bytes' mode)
- 7. offset from start of file at end of copy (used in 'bytes' mode)
The default reporter is like this:
XXX insert code when stable
"set_default_reporter"¶
Resets the reporter callback function to the default shown above.
"report_dupes($minlength, $copies, $length, $index_in_suffix_and_lcp_array)"¶
This function builds a data structure with position information for the duplication copies described by the input parameters. The entries in the suffix array from $index_in_suffix_and_lcp_array to "$index_in_suffix_and_lcp_array + $copies -1" will give the offsets in the string where the copies start. Each has a length of $length characters. With these values file names and line numbers are retrieved and stored in the structure. Then the reporter callback function is called with the minimum length of this scan $minlength and this structure. See also function set_reporter().
enter_files($ref_to_array_of_pathnames_of_files)¶
Reads the files given by the pathnames. Any files with length zero are skipped (and removed from the filename array). Offset arrays for file and line end positions are built. The content of all files is concatenated. Currently the content must not be valid Perl (but this might change when parsing gets involved in a future release).
build_suffixarray_and_lcp($textstring_to_analyse)¶
The XS routine calls the sais-lite function to construct the suffix and longest common prefix arrays for the complete string to analyse.
clip_lcp_to_fileboundaries(@array_with_endoffset_positions_of_files)¶
The XS routine limits lcp values that cross files according to given file end positions. The accumulated end positions have to be sorted in ascending order. Internally a binary search is done in order to find the right file end position. @Code::DRY::fileoffsets contain the needed offset when "enter_files" has been used before.
"reduce_lcp_to_nonoverlapping_lengths"¶
The XS routine limits lcp values that overlap with the preceeding entry. This is necessary to avoid overlap of the reported duplicates.
offset2fileindex($offset)¶
This function uses binary search to get the index of the respective file entry for this offset.
offset2filename($offset)¶
This function uses binary search to get the index of the respective file entry for this offset. It returns the filename for this entry.
offset2line($offset)¶
This function uses binary search to get the index of the respective file entry for this offset. Then it again uses binary search to get the max offset of the respective line and returns the line number for this entry.
"offsetAndFileindex2line($offset, $fileindex, \$nextcompleteLine, \$lastcompleteLine)"¶
This function uses binary search to get the max offset of the respective line belonging to file $fileindex and returns the line number for this entry.
If the third parameter is defined, it is expected to be a reference to a scalar. It will be set to the line number of the next line unless $offset points to the start of a line. Then it will be set to the current line number.
If the fourth parameter is defined, it is expected to be a reference to a scalar. It will be set to the line number of the previous line unless $offset points to the end of a line or there is no previous line. Then it will be set to the current line number.
"clearData"¶
This function clears all resources that were used for a scan.
get_len_at($index)¶
This XS function returns the lcp value at index $index or ~0, if the index is out of range.
get_offset_at($index)¶
This XS function returns the offset value from the suffix array at index $index or ~0, if the index is out of range.
get_isa_at($offset)¶
This XS function returns the index number from the suffix array where the $offset is found or ~0, if the offset is out of range.
set_lcp_to_zero_for_shadowed_substrings($index)¶
This XS function sets all prefix lengths to zero for those entries where the suffix is contained in another longer (or more leftward) suffix.
"get_concatenated_text($offset, $length)"¶
This function returns the given text portion of the internal concatenated string at offset $offset with a length of $length.
get_line_offsets_of_fileindex($fileindex)¶
This function returns the array reference of the line end offsets for the file at index $fileindex.
"get_next_ranked_index" not yet implemented¶
This XS function returns the next index number of the sorted lcp values or ~0, if there are no more entries left.
"reset_rank_iterator" not yet implemented¶
This XS function resets the iterator of the sorted lcp values.
"get_size"¶
This XS function returns the size of string (in 8-bit characters) used by the "build_suffixarray_and_lcp" function.
"get_lcp"¶
This XS function returns a reference of a copy of the lcp array from the "build_suffixarray_and_lcp" function.
"get_sa"¶
This XS function returns a reference of a copy of the suffix array from the "build_suffixarray_and_lcp" function.
"__get_text"¶
Internal function
"__free_all"¶
Internal function
DIAGNOSTICS¶
Output messages¶
Duplicates are reported (as per default reporter) in the following format:
1 duplicate(s) found with a length of 8 (>= 2 lines) and 78 bytes reduced to complete lines: 1. File: t/00_lowlevel.t in lines 57..64 (offsets 1467..1544) 2. File: t/00_lowlevel.t in lines 74..81 (offsets 1865..1942) ================= ...<duplicated content> =================
Error messages¶
This module can die with the following error messages:
- "cannot open file $file: $!\n";
The opening of a file for read access failed.
- "Error building suffix array:$!\n"
The XS code could not allocate enough memory for the combined file content.
BUGS AND LIMITATIONS¶
Probably some, it is new code :-).
Currently the underlying XS code operates with 8-bit characters only. With Perl source code that seems to work on most texts.
The full extent of masking out submatches has not yet beem implemented.
To report bugs, go to <http://rt.cpan.org/NoAuth/Bugs.html?Dist=Code-DRY> or send mail to <bug-Code-DRY#rt.cpan.org>
EXPORTED SYMBOLS¶
None by default.
ACKNOWLEDGEMENTS¶
Thanks to Yuta Mori for providing the C code for the construction of the suffix array (sais-lite) and to Johannes Fischer for extending it with the efficient generation of lcp values. I am grateful that both authors provided their work as open source.
Some code and ideas cribbed from:
Ovid's blog <http://blogs.perl.org/users/ovid/2012/12/finding-duplicate-code-in-perl.html>
SEE ALSO¶
- Suffix array construction algorithm: G. Nong, S. Zhang, and W. H. Chan. 'Linear suffix array construction by almost pure induced-sorting', In Proc. DCC, pages 193--202. IEEE Press, 2009
- LCP construction algorithm: Johannes Fischer, 'Inducing the LCP-Array' <http://arxiv.org/abs/1101.3448>
- C code: Yuta Mori, sais-lite 2.4.1 at <http://sites.google.com/site/yuta256/sais>
- C code: Johannes Fischer, sais-lite-lcp-master at <https://github.com/elventear/sais-lite-lcp>
- Perl code: Ovid, blog at <http://blogs.perl.org/users/ovid/2012/12/finding-duplicate-code-in-perl.html>, code at <https://gist.github.com/Ovid/4231878#file-find_duplicate_code-pl>
- Theory: Dan Gusfield, 'Algorithms on String, Trees, and Sequences', Cambridge University Press, 1999, ISBN 978-0521670357
AUTHOR¶
Heiko Eißfeldt, <hexcoder@cpan.org>
COPYRIGHT AND LICENSE¶
Copyright (C) 2014,2019 by hexcoder
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.
For files salcpis.[ch] from the sais-lite-lcp-master package:
2019-05-13 | perl v5.40.0 |