Fossil

Check-in [6eb9a30c]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:more optimizations (all lead bytes between 0x80 & 0xBF are invalid, so use simple check for those, and also can shrink the invalid_utf8 lead byte table even more)
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | invalid_utf8_table
Files: files | file ages | folders
SHA1:6eb9a30c085d755122b7f875ade5636fdb7300b2
User & Date: sdr 2016-06-16 17:01:00
Original Comment: more optimizations (all bytes between 0x80 & 0xBF are invalid, so use simple check for those, and also can shrink the invalid_utf8 lead byte table even more)
Context
2016-06-17
00:04
merged from trunk Closed-Leaf check-in: 8a877a7b user: sdr tags: invalid_utf8_table
2016-06-16
17:01
more optimizations (all lead bytes between 0x80 & 0xBF are invalid, so use simple check for those, and also can shrink the invalid_utf8 lead byte table even more) check-in: 6eb9a30c user: sdr tags: invalid_utf8_table
12:14
More optimizations, taken over from trunk. check-in: ec7f6b2e user: jan.nijtmans tags: invalid_utf8_table
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to src/lookslike.c.

153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
...
184
185
186
187
188
189
190


191
192
193
194
195
196
197
198
199
200
#define US4B  4, 0x80, 0xBF /* for lead bytes 0xF1-0xF3 */
#define US4C  4, 0x80, 0x8F /* for lead byte 0xF4 */
#define US0A  0xFF, 0xFF, 0x00 /* for any other lead byte */

/* a table used for quick lookup of the definition that goes with a
 * particular lead byte */
static const unsigned char lb_tab[] = {
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US2A, US0A, US2B, US2B, US2B, US2B, US2B, US2B,
  US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B,
  US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B,
  US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B,
  US3A, US3B, US3B, US3B, US3B, US3B, US3B, US3B,
  US3B, US3B, US3B, US3B, US3B, US3B, US3B, US3B,
  US4A, US4B, US4B, US4B, US4C, US0A, US0A, US0A,
................................................................................

  /* while we haven't checked all the bytes in the buffer */
  while( n>0 ){
    /* ascii is trivial */
    if( *z<0x80 ){
      ++z;
      --n;


    }else{
      /* get the definition for this lead byte */
      const unsigned char* def = &lb_tab[(3 * *z++)-0x180];
      unsigned char len;

      /* get the expected sequence length */
      len = *def;
      /* if there aren't enough bytes left, return invalid */
      if( n<len ) {
        return LOOK_INVALID;







<
<
<
<
<
<
<
<







 







>
>


|







153
154
155
156
157
158
159








160
161
162
163
164
165
166
...
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
#define US4B  4, 0x80, 0xBF /* for lead bytes 0xF1-0xF3 */
#define US4C  4, 0x80, 0x8F /* for lead byte 0xF4 */
#define US0A  0xFF, 0xFF, 0x00 /* for any other lead byte */

/* a table used for quick lookup of the definition that goes with a
 * particular lead byte */
static const unsigned char lb_tab[] = {








  US2A, US0A, US2B, US2B, US2B, US2B, US2B, US2B,
  US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B,
  US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B,
  US2B, US2B, US2B, US2B, US2B, US2B, US2B, US2B,
  US3A, US3B, US3B, US3B, US3B, US3B, US3B, US3B,
  US3B, US3B, US3B, US3B, US3B, US3B, US3B, US3B,
  US4A, US4B, US4B, US4B, US4C, US0A, US0A, US0A,
................................................................................

  /* while we haven't checked all the bytes in the buffer */
  while( n>0 ){
    /* ascii is trivial */
    if( *z<0x80 ){
      ++z;
      --n;
    }else if( *z<0xC0 ){
      return LOOK_INVALID;
    }else{
      /* get the definition for this lead byte */
      const unsigned char* def = &lb_tab[(3 * *z++)-0x240];
      unsigned char len;

      /* get the expected sequence length */
      len = *def;
      /* if there aren't enough bytes left, return invalid */
      if( n<len ) {
        return LOOK_INVALID;