
Technology
The following diagram from Patent 6,236,993
will help clarify our unique, patented technique
as it automatically discovers two files (1)
and (2) with different dates are identical
(except for the dates). Note the following:
- It is unnecessary to specify the starting
location or format of any specific mismatching
token (e.g. dates) within the files. It is
also unnecessary to specify the difference
between the tokens (e.g. the time period
that the file was aged for the second file).
- The process works even if the format of the
token changes between runs. Speed and accuracy
can be improved by limiting the list of what
token formats may be found in the file.
- The process is completely automatic by only
specifying token formats that may occur in
the file. Our utility contains a parser that
identifies hundreds of thousands of different
formats, and more are added upon demand.
- The process digitally verifies that mistmatching
tokens are the expected ones. It was facinating
to watch programmers checking for date problems
(during Y2K) when the first step was throw
away the date information and look for secondary
effects (like calculated interest for 100
years rather than 1 month). We said back
then, if you are looking for problems in
dates, "don't throw out the primary
source of information". Now we say,
"if you are not interested in the differences
in the time units, recognize those elements,
then verify the rest of the files are identical".
- Our utility controls which format sets to
check for in the files by switches such as
one that instructs the utility to "check
for dates with 3-digit years." Other
switches inhibit default behavior (such as
"do not check for 2-digit years").
The shaded area are portions of the files
that match identically.
This utility uses sound engineering processes,
not silver bullet magic. It utilizes the
computer's memory to compute the difference
between the two files. The description of
the process is as follows:
- First, the files are compared until a mismatching
character is detected between the two files.
- The first mismatch is detected at the (3)
symbol. Here, every possible interpretation
of dates that include the mismatching byte
are interpreted and added to a list. Assuming
only 2 and 4 digit years are present in the
file and all dates are within the 20th and
21st century. The possible dates from file
(1) are:
- Feb 20, 1901 (ymd format)
- Jan 2, 1920 (mdy format)
- Feb 1, 1920 (dmy format)
- Jan 2, 2000 (mdy format)
- Feb 1, 2000 (dmy format)
- Feb 20, 2001 (ymd format)
- Jan 2, 2020 (mdy format)
- Feb 1, 2020 (dmy format).
- Every possible date is also computed for
the second file. Here, the dates can be:
- Jan 2, 1998 (mdy format)
- Feb 1, 1998 (dmy format)
- Jan 2. 2098 (mdy format)
- Feb 1, 2098 (dmy format).
- The difference between every possible date
is computed. Since there are 8 dates possible
for the first file and 4 dates possible for
the second file, there are a total of 32
differences computed. Each of these differences
is placed on the list of potential differences
for the file. The only time that differences
are added to the list of potential differences
for the file is during this first step.
- The comparison process continues in the files
until the next difference is detected at
(4). Again, every possible date that includes
the mismatching bytes are computed and all
possible differences between those dates
are computed. The only possible date for
the "1999365" is Dec 31, 1999 -
a Lilian date format. There is also only
one date possible for the "12/31/97."
So, the difference between the dates in the
file is 2 years and the diagram reflects
the possible interpretations of the dates.
Any previously discovered potential difference
that does not occur in the current list are
eliminated from the list of potential difference
for the files.
- The process ends when there are no more differences
in the list of potential differences ("the
files are different") or the end of
the files are reached ("the files match").
In this specific illustration, only the dates
left are valid since there is a difference
of exactly 2 years between the files.
This is a description of one illustration
from the patent. It does not describe the
whole patent which generally covers all applications
of this technique to compare files (whenever
there could be some ambiguity on the interpretation
of a date, time, currency, temperature, etc.
contained within the files.) To view the
full patent, use the
Patent Office Patent Number Search database and search
for Patent No. 6,236,993.
Last modified June 6, 2001
