Fossil

Check-in [758e3d31]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:If the table is encoded as start-value/size, a variable and a comparison can be saved. Should be even faster ....
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | invalid_utf8_improvements
Files: files | file ages | folders
SHA1:758e3d318893fe5478bbcade2a5826574a07ec62
User & Date: jan.nijtmans 2016-06-18 16:50:53
Context
2016-06-26
17:04
Improve comments Closed-Leaf check-in: 8bdd0abc user: jan.nijtmans tags: invalid_utf8_improvements
2016-06-18
16:50
If the table is encoded as start-value/size, a variable and a comparison can be saved. Should be even faster .... check-in: 758e3d31 user: jan.nijtmans tags: invalid_utf8_improvements
14:44
Juggle variables and code arround, making it as efficient and readable as possible. Also add more comments. check-in: 7f067f29 user: jan.nijtmans tags: invalid_utf8_improvements
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to src/lookslike.c.

145
146
147
148
149
150
151
152

153
154
155
156
157
158
159
160
161
162
163
164
165
166
...
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
** It's number of higher 1-bits indicate the number of continuation bytes
** that are expected to be followed. E.g. when 'c2' has a value in the range
** 0xc0..0xdf it means that 'c' is expected to contain the last continuation
** byte of a UTF-8 character. A value 0xe0..0xef means that after 'c' one
** more continuation byte is expected.
*/

/* definitions for various UTF-8 sequence lengths */

#define US2A  0x7F, 0x80 /* for lead byte 0xC0 */
#define US2B  0x7F, 0xBF /* for lead bytes 0xC2-0xDF */
#define US3A  0x9F, 0xBF /* for lead byte 0xE0 */
#define US3B  0x7F, 0xBF /* for lead bytes 0xE1-0xEF */
#define US4A  0x8F, 0xBF /* for lead byte 0xF0 */
#define US4B  0x7F, 0xBF /* for lead bytes 0xF1-0xF3 */
#define US4C  0x7F, 0x8F /* for lead byte 0xF4 */
#define US0A  0xFF, 0x00 /* for any other lead byte */

/* a table used for quick lookup of the definition that goes with a
 * particular lead byte */
static const unsigned char lb_tab[] = {
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
................................................................................
  unsigned int n = blob_size(pContent);
  unsigned char c; /* lead byte to be handled. */

  if( n==0 ) return 0;  /* Empty file -> OK */
  c = *z;
  while( --n>0 ){
    if( c>=0x80 ){
      unsigned char fb = *++z; /* follow-up byte after lead byte */
      const unsigned char *def; /* pointer to range table*/

      c <<= 1; /* multiply by 2 and get rid of highest bit */
      def = &lb_tab[c]; /* search fb's valid range in table */
      if( (fb<=def[0]) || (fb>def[1]) ){
        return LOOK_INVALID; /* Invalid UTF-8 */
      }
      c = (c>=0xC0) ? (c|3) : ' '; /* determine next lead byte */
    } else {
      c = *++z;
    }
  }







|
>
|
|
|
|
|
|
|







 







<




|







145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
...
188
189
190
191
192
193
194

195
196
197
198
199
200
201
202
203
204
205
206
** It's number of higher 1-bits indicate the number of continuation bytes
** that are expected to be followed. E.g. when 'c2' has a value in the range
** 0xc0..0xdf it means that 'c' is expected to contain the last continuation
** byte of a UTF-8 character. A value 0xe0..0xef means that after 'c' one
** more continuation byte is expected.
*/

/* definitions for various UTF-8 sequence lengths, encoded as start value
 * and size of each valid range belonging to some lead byte*/
#define US2A  0x80, 0x01 /* for lead byte 0xC0 */
#define US2B  0x80, 0x40 /* for lead bytes 0xC2-0xDF */
#define US3A  0xA0, 0x20 /* for lead byte 0xE0 */
#define US3B  0x80, 0x40 /* for lead bytes 0xE1-0xEF */
#define US4A  0x90, 0x30 /* for lead byte 0xF0 */
#define US4B  0x80, 0x40 /* for lead bytes 0xF1-0xF3 */
#define US4C  0x80, 0x10 /* for lead byte 0xF4 */
#define US0A  0xFF, 0x00 /* for any other lead byte */

/* a table used for quick lookup of the definition that goes with a
 * particular lead byte */
static const unsigned char lb_tab[] = {
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
................................................................................
  unsigned int n = blob_size(pContent);
  unsigned char c; /* lead byte to be handled. */

  if( n==0 ) return 0;  /* Empty file -> OK */
  c = *z;
  while( --n>0 ){
    if( c>=0x80 ){

      const unsigned char *def; /* pointer to range table*/

      c <<= 1; /* multiply by 2 and get rid of highest bit */
      def = &lb_tab[c]; /* search fb's valid range in table */
      if( (unsigned int)(*++z-def[0])>=def[1] ){
        return LOOK_INVALID; /* Invalid UTF-8 */
      }
      c = (c>=0xC0) ? (c|3) : ' '; /* determine next lead byte */
    } else {
      c = *++z;
    }
  }