KVIrc and CTCP

For developers: Client-To-Client Protocol handling in KVIrc

Introduction

Personally, I think that the CTCP specification is to be symbolically printed & burned. It is really too complex (you can go mad with the quoting specifications) and NO IRC CLIENT supports it completely. Here is my personal point of view on the CTCP protocol.

What is CTCP?

CTCP stands for Client-to-Client Protocol. It is designed for exchanging almost arbitrary data between IRC clients; the data is embedded into text messages of the underlying IRC protocol.

Basic concepts

A CTCP message is sent as the <text> part of the PRIVMSG and NOTICE IRC commands.
To differentiate the CTCP message from a normal IRC message text we use a delimiter character (ASCII char 1); we will use the symbol <0x01> for this delimiter. You may receive a CTCP message from server in one of the following two ways:
:<source_mask> PRIVMSG <target> :<0x01><ctcp message><0x01>
:<source_mask> NOTICE <target>:<0x01><ctcp message><0x01>
The PRIVMSG is used for CTCP REQUESTS, the NOTICE for CTCP REPLIES. The NOTICE form should never generate an automatic reply.
The two delimiters were used to begin and terminate the CTCP message; The original protocol allowed more than one CTCP message inside a single IRC message. Nobody sends more than one message at once, no client can recognize it (since it complicates the message parsing), it could be even dangerous (see below). It makes no real sense unless we wanted to use the CTCP protocol to embed escape sequences into IRC messages, which is not the case.
Furthermore, sending more CTCP messages in a single IRC message could be easily used to flood a client. Assuming 450 characters available for the IRC message text part, you could include 50 CTCP messages containing "<0x01>VERSION<0x01>".
Since the VERSION replies are usually long (there can be 3 or 4 replies per IRC message), a client that has no CTCP flood protection (or has it disabled) will surely be disconnected while sending the replies, after only receiving a single IRC message (no flood for the sender). From my personal point of view, only one CTCP message per IRC message should be allowed and theoretically the trailing <0x01> delimiter can be optional.

How to extract the CTCP message

The IRC messages do not allow the following characters to be sent:
<NUL> (ASCII character 0), <CR> (Carriage return), <LF> (Line feed).
So finally we have four characters that cannot appear literally into a CTCP message: <NUL>,<CR>,<LF>,<0x01>.
To extract a <ctcp_message> from an IRC PRIVMSG or NOTICE command you have to perform the following actions:
Find the <trailing> part of the IRC message (the one just after the ':' delimiter, or the last message token).
Check if the first character of the <trailing> is a <0x01>, if it is we have a <ctcp_message> beginning just after <0x01>. The trailing (optional) <0x01> can be removed in this phase or later, assuming that it is not a valid char in the <ctcp message>.
In this document I will assume that you have stripped the trailing <0x01> and thus from now on we will deal only with the <ctcp message> part.

Parsing a CTCP message: The quoting dilemma

Since there are characters that cannot appear in a <ctcp message>, theoretically we should have to use a quoting mechanism. Well, in fact, no actual CTCP message uses the quoting: there is no need to include a <NUL>, a <CR> or <LF> inside the actually defined messages (The only one could be CTCP SED, but I have never seen it in action... is there any client that implements it?). We could also leave the quoting to the single message type semantic: a message that needs to include any character could have its own encoding method (Base64 for example). With the "one CTCP per IRC message" convention we could even allow <0x01> inside messages. Only the leading (and eventually trailing) <0x01> would be the delimiter, the other ones would be valid characters. Finally, is there any CTCP type that needs <0x01> inside a message? <0x01> is not printable (as well as <CR>,<LF> and <NUL>), so only encoded messages (and again we can stick to the single message semantic) messages or the ones including special parameters. Some machines might allow <0x01> in filenames... well, a file with <0x01> in its name has something broken inside, or the creator is a sort of hacker (so he also knows how to rename a file...) :).
Anyway, let's be pedantic, and define this quoting method. Let's use the most intuitive method, adopted all around the world:
The backslash character ('\') as escape.
An escape sequence is formed by the backslash character and a number of following ASCII characters. We define the following two types of escape sequences:
'\XXX' (where XXX is an octal number formed by three digits) that indicates the ASCII character with code that corresponds to the number.
'\C' (where C is a CTCP valid ASCII non digit character) that corresponds literally to the character C discarding any other semantic that might be associated with it (This will become clear later). I've chosen the octal representation just to follow a bit the old specification: the authors seemed to like it. This point could be discussed in some mailing list or sth. The '\C' sequence is useful to include the backslash character (escape sequence '\\').

Let's mess a little more

A CTCP message is made of space separated parameters.
The natural way of separating parameters is to use the space character. We define a token as a sequence of valid CTCP characters not including literal space. A <ctcp parameter> is usually a token, but not always; filenames can contain spaces inside names (and it happens very often!). So one of the parameters of CTCP DCC is not a space separated token. How do we handle it? Again a standard is missing. Some clients simply change the filename placing underscores instead of spaces, this is a reasonable solution if used with care. Other clients attempt to isolate the filename token by surrounding it with some kind of quotes, usually the " or ' characters. This is also a good solution. Another one that naturally comes into my mind is to use the previously defined quoting to define a non-breaking space character, because a space after a backslash could lose its original semantic. Better yet, use the backslash followed by the octal representation of the space character ('\040'). Anyway, to maintain compatibility with other popular IRC clients (such as mIRC), let's include the " quotes in our standard: literal (unescaped) " quotes define a single token string. To include a literal " character, escape it. Additionally, the last parameter of a <ctcp message> may be made of multiple tokens.

A CTCP parameter extracting example

A trivial example of a C CTCP parameter extracting routine follows.
An IRC message is made of up to 510 usable characters. When a CTCP is sent there is a PRIVMSG or NOTICE token that uses at least 6 characters, at least two spaces and a target token (that can not be empty, so it is at least one character) and finally one <0x01> escape character. This gives 500 characters as maximum size for a complete <ctcp message> and thus for a <ctcp token>. In fact, the <ctcp message> is always smaller than 500 characters; there are usually two <0x01> chars, there is a message source part at the beginning of the IRC message that is 10-15 characters long, and there is a : character before the trailing parameter. Anyway, to really be on the safe side, we use a 512 character buffer for each <ctcp token>. Finally, I'll assume that you have already ensured that the <ctcp message> that we are extracting from is shorter than 511 characters in all, and have provided a buffer big enough to avoid this code segfaulting. I'm assuming that msg_ptr points somewhere in the <ctcp message> and is null-terminated.
(There are C++ style comments, you might want to remove them)

const char * decode_escape(const char * msg_ptr,char * buffer)
{
    // This one decodes an escape sequence
    // and returns the pointer "just after it"
    // and should be called when *msg_ptr points
    // just after a backslash
    char c;
    if((*msg_ptr >= '0') && (*msg_ptr < '8'))
    {
        // a digit follows the backslash
        c = *msg_ptr - '0';
        msg_ptr++;
        if(*msg_ptr >= '0') && (*msg_ptr < '8'))
        {
            c = ((c << 3) + (*msg_ptr - '0'));
            msg_ptr++;
            if(*msg_ptr >= '0') && (*msg_ptr < '8'))
            {
                c = ((c << 3) + (*msg_ptr - '0'));
                msg_ptr++;
            } // else broken message, but let's be flexible
        } // else it is broken, but let's be flexible
        // append the character and return
        *buffer = c;
        return msg_ptr;
    } else {
        // simple escape: just append the following
        // character (thus discarding its semantic)
        *buffer = *msg_ptr;
        return ++msg_ptr;
    }
}
const char * extract_ctcp_parameter(const char * msg_ptr,char * buffer,int spaceBreaks)
{
    // this one extracts the "next" ctcp parameter in msg_ptr
    // it skips the leading and trailing spaces.
    // spaceBreaks should be set to 0 if (and only if) the
    // extracted parameter is the last in the CTCP message.
    int inString = 0;
    while(*msg_ptr == ' ')msg_ptr++;
    while(*msg_ptr)
    {
        switch(*msg_ptr)
        {
            case '\\':
                // backslash : escape sequence
                msg_ptr++;
                if(*msg_ptr)msg_ptr = decode_escape(msg_ptr,buffer);
                else return msg_ptr; // senseless backslash
            break;
            case ' ':
                // space : separate tokens?
                if(inString || (!spaceBreaks))*buffer++ = *msg_ptr++;
                else {
                 // not in string and space breaks: end of token
                 // skip trailing white space (this could be avoided)
                 // and return
                 while(*msg_ptr == ' ')msg_ptr++;
                 return msg_ptr;
                }
            break;
            case '"':
                // a string begin or end
                inString = !inString;
                msg_ptr++;
            break;
            default:
                // any other char
                *buffer++ = *msg_ptr++;
            break;
        }
    }
    return msg_ptr;
}

CTCP parameter semantics

The first <ctcp parameter> of a <ctcp message> is the <ctcp tag>: it defines the semantic of the rest of the message.
Although it is a convention to specify the <ctcp tag> as uppercase letters, and the original specification says that the whole <ctcp message> is case sensitive, I'd prefer to follow the IRC message semantic (just to have less "special cases") and treat the whole message as case insensitive.
The remaining tokens depend on the <ctcp tag>. A description of known <ctcp tags> and thus <ctcp messages> follows.

PING

Syntax: <0x01>PING <data><0x01>
The PING request is used to check the round trip time from one client to another. The receiving client should reply with exactly the same message but sent through a NOTICE instead of a PRIVMSG. The <data> usually contains an unsigned integer but not necessarily; it is not even mandatory for <data> to be a single token. The receiver should ignore the semantic of <data>.
The reply is intended to be processed by IRC clients.

VERSION

Syntax: <0x01>VERSION<0x01>
The VERSION request asks for information about another user's IRC client program. The reply should be sent through a NOTICE with the following syntax:
<0x01>VERSION <client_version_data><0x01>
The preferred form for <client_version_data> is <client_name>:<client_version>:<client_enviroinement>, but historically clients (and users) send a generic reply describing the client name, version and eventually the used script name. This CTCP reply is intended to be human readable, so any form is accepted.

USERINFO

Syntax: <0x01>USERINFO<0x01>
The USERINFO request asks for information about another user. The reply should be sent through a NOTICE with the following syntax:
<0x01>USERINFO <user_info_data><0x01>
The <user_info_data> should be a human readable user defined string;

CLIENTINFO

Syntax: <0x01>CLIENTINFO<0x01>
The CLIENTINFO request asks for information about another user's IRC client program. While VERSION requests the client program name and version, CLIENTINFO requests information about CTCP capabilities.
The reply should be sent through a NOTICE with the following syntax:
<0x01>CLIENTINFO <client_info_data><0x01>
The <client_info_data> should contain a list of supported CTCP request tags. The CLIENTINFO reply is intended to be human readable.

FINGER

Syntax: <0x01>FINGER<0x01>
The FINGER request asks for information about another IRC user. The reply should be sent through a NOTICE with the following syntax:
<0x01>FINGER <user_info_data><0x01>
The <user_info_data> should be a human readable string containing the system username and possibly the system idle time;

SOURCE

Syntax: <0x01>SOURCE<0x01>
The SOURCE request asks for the client homepage or ftp site information. The reply should be sent through a NOTICE with the following syntax:
<0x01>VERSION <homepage_url_data><0x01>
This CTCP reply is intended to be human readable, so any form is accepted.

TIME

Syntax: <0x01>TIME<0x01>
The TIME request asks for the user local time. The reply should be sent through a NOTICE with the following syntax:
<0x01>TIME <time and date string><0x01>
This CTCP reply is intended to be human readable, so any form is accepted.

ACTION

Syntax: <0x01>ACTION<0x01>
The ACTION tag is used to describe an action.
It should be sent through a NOTICE message and never generate a reply.

AVATAR (equivalent to ICON or FACE)

Syntax: <0x01>AVATAR<0x01>
The AVATAR tag is used to query a user's avatar.

MULTIMEDIA (equivalent to MM or SOUND)

Syntax: <0x01>MULTIMEDIA <filename><0x01>
The MULTIMEDIA tag is used to play a multimedia file on the receiver's side.
The receiving client should locate the file associated to <filename>, and play it. If the file can not be located by the receiving client, and the MULTIMEDIA tag was sent through a PRIVMSG format CTCP, the receiving client CAN request a DCC GET <filename> from the source user. If the MULTIMEDIA tag was sent through a NOTICE message, the receiving client should not generate any reply: the message should be notified to the receiving client's user and then be discarded. The <filename> should never contain a leading path. If any part of the <filename> appears to be a path component, it should be discarded. The client may decide to drop the entire message too. Older clients (including older releases of KVIrc) used to request the missing filenames by a particular non-standard private message syntax. This convention should be dropped.

DCC

Syntax: <0x01>DCC <type> <type dependent parameters><0x01>
The DCC tag is used to initiate a Direct Client Connection. The known DCC types are:



CHAT

SEND

SSEND

TSEND

GET

TGET

ACCEPT

RESUME

Index, Miscellaneous