new fossil use: irc log publishing

(1.1) By John Rouillard (rouilj) on 2020-11-27 14:39:05 edited from 1.0 [link] [source]

Hi all:

I find IRC logs useful. Having to grep them not so much. So I merged Fossil and IRCLogBot.

IRCLogBot is a simple python script that connects to an IRC channel and writes logs of user messages in markdown format. It generates a new file every day named (for example) 2020-11-06.md.

I put the log directory under fossil control and run a script every minute

   fossil add .; fossil ci -m "automatic update";

to update the web site.

I changed the Index Page setting to /doc/trunk/log/irc_channel/. I also had to add an index to the irc_channel directory. If you don't have one you get a report that index.html is missing in the trunk version.

To generate an index, I use index.md rather than index.html. index.md is generated with:

  ls -r 20* | sed 's/\(.*\).md/[index for \1](\1.md)/' > index.md

which should be good for the next 99 or so years. Note that the <> form of generating URL's doesn't work because I am using a relative URL that doesn't begin with http/https/....

One of the big issues with IRC logs is trying to find information. Users could clone the repo and grep the files, but I enabled search for documentation. Now searching for keywords shows the keywords in the irc message which also includes dates (indicated by filename) and usernames (included in each logged message).

One tricky part is that the irc logs include a datestamped id for each message. It should be possible to simply click on the datestamp and be able to skip to that entry. If you use the /files entry point, the filename is described using url query arguments. E.G. rather than: /doc/trunk/log/roundup/2020-11-06.md you get https:.../file?name=log/roundup/2020-11-06.md&ci=tip and links in the document that look like:

   <a href="#20:21.10" id="20:21.10">20:21.10 (EST)</a>

resolve to:

https:.../file#20:21.10

rather than:

https:.../file?name=log/roundup/2020-11-06.md&ci=tip#20:21.10

I think this is a bug. It looks like the base tag doesn't include the query parameters which are required to make relative URL's in the document work. Since the /doc tree has url's that are proper paths, I chose to use that.

This was also noticed in September by Warren in https://fossil-scm.org/forum/forumpost/04dc0589dc?t=h. I think simply removing the base tag when displaying files under the /file target will solve this issue as I can't imagine the current behavior being useful.

(2) By MBL (RoboManni) on 2020-11-08 11:12:21 in reply to 1.0 [link] [source]

I can see some similarities with one of my use cases: forum post . But while you are doing the commitments on a minute base and use delta-compression I do it only once a day; after I started on an hourly base. What is your repository growth over time and in total and the compression rate? Can you add your repository /stat ?

My search mechanism is pending to be developed and implemented in an web enabled style and I can see that you did that already somehow. I consider using the /ext feature (fossil server --extroot <directory>)to implement online search capability inside the file artifacts using the bisect method and CGI extensions as my data is basically number driven. In your case you organize search by an index file, which is committed as well and linked with the purpose to download the whole relevant log file? (Before-commit approach in your case versus online search in my case.)

Without having examples seen I felt difficulties to follow your issue description - maybe also because I am more active on Windows than on Linux and not familiar enough with the sed command, yet.

Maybe we both focus on document search capabilities similar to the command line fossil grep but as a web enabled solution as the common denominator? What web search support does fossil provide, which I may not know yet about? How could searching documents be improved (not only but especially) via web interface? Are we the only few people looking for such features? fossil could become also a tool to be used as a search engine. With grep and bisect there are already good methods available - but only on command line, am I right?

(4) By John Rouillard (rouilj) on 2023-09-29 02:25:53 in reply to 2 [link] [source]

Hello RoboManni, I must have missed your update. Here are the stats you are interested in.

Repository Size:	4,001,792 bytes
Number Of Artifacts:	3,716 (213 fulltext and 3,503 deltas) Details
Uncompressed Artifact Size:	6,365 bytes average, 53,635 bytes max, 23,652,664 total
Compression Ratio:	5:1
Number Of Check-ins:	1,722
Number Of Files:	201
Number Of Wiki Pages:	1
Project Age:	2 years, 10 months, 22 days

so a 5:1 compression ratio.

Also while I check to see if any logs have been updated every minute, the commit succeeds only if there is an update.

You can look at the log using:

  https://rouilj.dynamic-
     dns.net/fossil/roundup_irc_logs

Just paste the lines together. I am just using the standard fossil fts5 search. With a document glob list of * and a document branch of trunk and with Search documents enabled.

I was using fts tokenizer of none, I have changed it to use the porter stemmer since the logs are in English. Also the none stemmer is about 780kb while the porter stemmer is 757k.

(5) By MBL (RoboManni) on 2023-10-03 17:36:58 in reply to 4 [link] [source]

Hello John, thank you very much for this reply ... I did not expect any reply more after such long time since the last post.

(3) By MBL (RoboManni) on 2020-11-08 13:15:51 in reply to 1.0 [source]

I try to show this use case as a pikchr diagram:

L1: "chkin 1:"
B1: box "1-200"           wid 200% ht 50% fill 0xc6e2ff thin
B2: box "1-300 (100 new)" wid 300% ht 50% fill 0xc6e2ff thin with .nw at B1.sw
B3: box "1-400 (100 new)" wid 400% ht 50% fill 0xC0C010 thin color green  thin with .nw at B2.sw
B4: box "1-500 at end of day 1" wid 500% ht 50% fill 0xE0E010 thick color green  with .nw at B3.sw

B5: box "501-600" wid 100% ht 50% fill 0xc6e2ff thin     with .nw at B4.se
B6: box "501-700" wid 200% ht 50% fill 0xc6e2ff thin with .nw at B5.sw
B7: box "501- end of day 2" wid 300% ht 50% fill 0xc6e2ff thin with .nw at B6.sw

A1: arrow from (B5.e, B1.n) left 380%
line color red down until even with last box.s
"440 (date-stamped id to find)" ljust bold at A1.e + (0.05,0)

L2:  "chkin 2:" at (L1.s.x,B2.c.y)
L3:  "chkin 3:" at (L1.s.x,B3.c.y)
L4:  "chkin 4:" at (L1.s.x,B4.c.y)

L5:  "chkin 5:" at (L1.s.x,B5.c.y)
L6:  "chkin 6:" at (L1.s.x,B6.c.y)
L7:  "chkin 7:" at (L1.s.x,B7.c.y)

I assume the id per day is monotone increasing like the day number is, too. Search by date stamped sequence id could be done by bisect method as well, the search direction for next bisect step would be deterministic.

Search by grep would need to be done in each new delta-compressed artifact and could only be done linear ... a time stopper for e.g. "search within last week" should limit the search per grep.

(6) By seeg on 2023-11-03 11:55:23 in reply to 3 [link] [source]

If anyone is interested, I coded a very small Rust program to listen for a given IRC channel and log into given directory. One can run it like this:

cargo run -- --nickname irc-logger --server irc.libera.chat --room '#some-test-room' --output-dir $HOME/some-test-room-logs

One has to have a fossil repo in $HOME/some-test-room-logs configured.

I then serve that repo using fossil serve.