Fossil: Artifact [b451033b]

Artifact b451033b2624b50b2a863f8eeffcdd386e2f6f5b:

Wiki page [Import CVS Repositories] by aku 2008-03-14 21:07:40.
D 2008-03-14T21:07:40
L Import\sCVS\sRepositories
P f2c05a6ad279c0cb0c7249b9ff0cea12d9153bf6
U aku
W 5644
Spiritual ancestor: [http://cvs2svn.tigris.org/|cvs2svn].

Similarities:
  *  Using identical highlevel architecture (pass-based)
  *  Using some specific algorithms (graph traversal).

Differences:
  *  Not using any of its code (Different languages for one thing, [http://www.python.org/|Python] there, [http://www.tcl.tk/|Tcl] here).
  *  Persistent state completely different, using [http://www.sqlite.org/|sqlite] database for all things we wish to keep between passes.

<p>
Status:
<table  border=1>
<tr>
<th>Pass</th>
<th>Description</th>
<th>Notes</th>
</tr>
<tr>
<td>CollectAr</td>
<td>Collect archives</td>
<td>Ok</td>
</tr>
<tr>
<td>CollectRev</td>
<td>Collect revisions, tags, branches (file level)</td>
<td>Ok</td>
</tr>
<tr>
<td>CollateSymbols</td>
<td>Collate symbol (project level) from the file level data</td>
<td>Ok</td>
</tr>
<tr>
<td>FilterSymbols</td>
<td>Filter symbols, exclude symbols and lines of development</td>
<td>Ok</td>
</tr>
<tr>
<td>InitCsets</td>
<td>Create initial changesets</td>
<td>Ok.</td>
</tr>
<tr>
<td>CsetDeps</td>
<td>Compute changeset dependencies from revision dependencies</td>
<td>Ok</td>
</tr>
<tr>
<td>BreakRevCsetCycles</td>
<td>Break cycles among revision changesets</td>
<td>Ok</td>
</tr>
<tr>
<td>RevTopologicalSort</td>
<td>Topologically sort revision changesets</td>
<td>Ok</td>
</tr>
<tr>
<td>BreakSymCsetCycles</td>
<td>Break cycles among symbol changesets</td>
<td>Ok</td>
</tr>
<tr>
<td>BreakAllCsetCycles</td>
<td>Break cycles over all changesets</td>
<td>Ok. (Accepting that it may still change the order of revision changesets over the result of pass 7).</td>
</tr>
<tr>
<td>AllTopologicallSort</td>
<td>Topologically sort all changesets</td>
<td>Ok</td>
</tr>
<tr>
<td>ImportFiles</td>
<td>Import files</td>
<td>Ok.</td>
</tr>
<tr>
<td>ImportCSets</td>
<td>Import changesets</td>
<td>Ok.</td>
</tr>
<tr>
<td>ImportFinal</td>
<td>Import finalization (fossil rebuild)</td>
<td>Ok.</td>
</tr>
</table>
</p>


Notes regarding the actual import:
<ul>
<li>cvs2svn is either slow, or hungry for diskspace. The reason: It is importing changeset by changeset and so has to either regenerate the needed revisions of the files on-demand over and over, or it caches the needed revisions when created until the last user is gone.
</li>
<li>We can do better, if we get help from fossil. We would need commands to perform the following actions:
<ul>
<li> Import a file as blob, return its internal id.
</li>
<li> Deltify a known file respective to a second known file.
</li>
<li> Generate a manifest file for a list of files (paths, ids), parent manifest references, user, timestamp, log message. Could be signed or not.
</li>
</ul>
<br>
With these actions (possibly in combination) we can import the archive files first, needing only space for the revisions of a single file (bounded by the largest file in terms of size and history), with their delta-links mirroring the RCS structure. After that we can independently generate, import, and deltify the manifests for changesets. To finalize we simply 'rebuild' the repository. This should be fast without needing much temporary disk space either.
<br>
Currently I am thinking about wanting a command
<pre>
fossil import-files SPECFILE
</pre>
which imports and deltifies files as blob per the SPECFILE, returning on stdout the associated of paths and internal ids. Example SPECFILE:
<pre>
F path1
D path2 path1
...
</pre>
The 'F' card tells it to import 'path1' as is. The 'D' card then tells it to import 'path2' as is and then deltify 'path1' in terms of 'path2'. And so on.
<br>
To import an RCS archive we now just have to generate the full revisions in some tempdir, then create a SPEC showing the delta-links and let fossil have this. The resulting ids then go into the importer state for when we generate the manifests.
<br>
To import manifests we need second command, which generates it from changeset data, and imports it too.
<pre>
fossil import-manifest user timestamp MESSAGEFILE FILEFILE parent...
</pre>
The 'messagefile' contains the log message, the 'filefile' lists the revisions in the manifest by id, and with path in the workspace. The parents are the ids of the parent manifests. The result of the command is the id of the new manifest.
<br>
Note: We have already commands for the put and deltify operations, albeit undocumented, in the set of test commands.
</li>
</ul>

Miscellaneous:
<ul>
<li>Add option to allow diverting the log output into a file, except for the progress reporting. Result is nice interactive log + progress, and a (relatively) small log file without all the unwanted progress numbers.
</li>
<li>Go over the SQL statements and check that they have good query plans
</li>
<li>Go over the SQL statements and add comments where they are non-trivial.
</li>
<li>Currently some passes take quite a bit of time when their actual operation is already complete. It seems that sqlite is heavily working on committing a lot of changes (Most passes are wrapped into a single transaction). Consider ways of speeding this up.
<ul>
<li>Do our own transaction management? So that we can commit every X changes?.
</li>
<li>Should we disable synchronous operation? The state database is not that critical, i.e. it can always be regenerated. On the other hand, being able to be sure that we can restart from the interrupted pass is nice, and not possible if we got it corrupted by asynchronous operation. <b>Tabled for now, this is not yet critical</b>
</li>
</ul>
</ul>

Z 779f7588ec8ced927b372f431d8153d3