Quantcast

Re: gnuplot slow on BIG files

classic Classic list List threaded Threaded
58 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Dimitrios Apostolou
Daniel J Sebald wrote:
> Please *explain* why the patch is faster.  Those listening will
> understand. Also, when running diff be sure to use unified (-u) so that
> it indicates what file the hunks come from.

I don't know why it is faster. I just wrote a simple parser. I don't
understand what more the old parser does. I just can see that it is much
more complicated. My guess is that after so many years of development
and after many additions that today we see but can't figure out, the
code became a bit "bloated".

Do what you think is better: optimize the current parser, rewrite a new
one, or use mine as a base for improvement. One thing I know for sure:
it shouldn't stay as it is.

The patch I published is against the file:
gnuplot-4.0.0/src/datafile.c

Dimitris


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Dimitrios Apostolou
Here is the unified patch. In the previous patch I also diffed the files
in the wrong order.

Dimitris

--- gnuplot-4.0.0/src/datafile.c 2004-04-13 20:23:52.000000000 +0300
+++ datafile.c 2005-06-03 01:43:44.000000000 +0300
@@ -226,6 +226,8 @@
 #endif
 static TBOOLEAN mixed_data_fp = FALSE; /* inline data */
 static char *df_filename; /* name of data file */
+static off_t filesize;
+
 
 #ifndef MAXINT /* should there be one already defined ? */
 # ifdef INT_MAX /* in limits.h ? */
@@ -556,6 +558,75 @@
 
 /*}}} */
 
+#define PBS 30  /* Parse Buffer Size */
+
+static float **
+df_read_matrix_jimis(rows, cols)
+    int *rows, *cols;
+{
+ *rows = 0;
+ *cols = 0;
+ float **rmatrix = NULL;
+ float *tmp_arr = NULL;
+ int ch, newline, elements_num, j, error;
+ char par_buf[PBS]; /* Parse Buffer */
+
+ /* malloc the maximum we may use, it's ok in an overcommiting OS like linux */
+ tmp_arr = gp_alloc(filesize * sizeof(float), "df_matrix");
+
+ newline=1; elements_num=0; j=0; error=0;
+ do { /* Let the parsing begin! */
+ ch=fgetc(data_fp);
+ if (ch==' ' || ch=='\t' || ch=='\n' || ch=='\r' || ch==EOF) {
+ if (j!=0)  { /* if I have just read a field */
+ par_buf[j]='\0';
+ tmp_arr[elements_num++] = atof(par_buf);
+ j=0;
+ }
+ /* We count one line less if the file doesn't end with a newline*/
+ if (ch=='\n') {
+ /* one more line parsed */
+ (*rows)++;
+ newline=1;
+ }
+ }
+ else
+ if (newline)
+ /* Bypass lines starting with # */
+ if (ch=='#') {
+ do {
+ ch=fgetc(data_fp);
+ } while ((ch!='\n') && (ch!=EOF));
+ ungetc(ch, data_fp); ch=' ';
+ }
+ else
+ {
+ if (j<PBS) par_buf[j++]=ch;
+ newline=0;
+ }
+ else
+ if (j<PBS) par_buf[j++]=ch;
+ } while (ch!=EOF && !error);
+ if (*rows == 0) {
+ free(tmp_arr);
+ return NULL;
+ }
+ else
+ *cols = elements_num / *rows;
+ if (error) /* currently not so useful because atof() doesn't detect errors */
+ int_error(NO_CARET, "Bad Matrix");
+
+ /* Free unused memory */
+ tmp_arr = gp_realloc(tmp_arr, (*rows) * (*cols) * sizeof(float), "df_matrix");
+
+ /* fix the indexes of the 2D array */
+ rmatrix = gp_alloc((*rows) * sizeof(float *), "df_matrix");
+ for (j=0; j<(*rows); j++) {
+ rmatrix[j] = &(tmp_arr[j*(*cols)]);
+ }
+
+ return rmatrix;
+}
 
 /*{{{  int df_open(max_using) */
 int
@@ -739,6 +810,8 @@
     os_error(name_token, "\"%s\" is not a regular file or pipe",
      df_filename);
  }
+ if (stat(df_filename, &statbuf) > -1) filesize=statbuf.st_size;
+
 #endif /* HAVE_SYS_STAT_H */
  if ((data_fp = loadpath_fopen(df_filename, df_binary ? "rb" : "r")) ==
     (FILE *) NULL) {
@@ -1294,7 +1367,7 @@
  /* fread_matrix() drains the file */
  df_eof = 1;
     } else {
- if (!(dmatrix = df_read_matrix(&nr, &nc))) {
+ if (!(dmatrix = df_read_matrix_jimis(&nr, &nc))) {
     df_eof = 1;
     return 0;
  }
@@ -1408,7 +1481,10 @@
  this_plot->num_iso_read++;
     }
 
-    free_matrix(dmatrix, 0, nr - 1, 0);
+    if (df_binary)
+ free_matrix(dmatrix, 0, nr - 1, 0);
+    else
+ free(*dmatrix);
     if (rt)
  free_vector(rt, 0);
     if (ct)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Dimitrios Apostolou
In reply to this post by Dimitrios Apostolou
> On Thursday 02 June 2005 03:53 pm, Dimitrios Apostolou wrote:
>
>>So I submit to you a patch (against the v. 4.0 gnuplot) for the file
>>src/datafile.c as a proof of concept
>
>
> Could you please re-do the patch using
> diff -ur  <oldfile> <newfile>
>

I submit the patch for a third time, sorry for the spamming, but since
I´m not subscribed to the list my emails wait for approval.

This time I used the format you suggested:
diff -ur  <oldfile> <newfile>

>>There are many things in my code that you 'll not probably like.
>
>
> Your comments worry me.  For instance:
>
> < /* malloc the maximum we may use, it's ok in an overcommiting OS like linux */
> < tmp_arr = gp_alloc(filesize * sizeof(float), "df_matrix");
>
> gnuplot core code must run on systems other than linux.
> And even for linux your statement is not true.  Many people doing
> serious number crunching will not run linux in overcommit mode, because
> it is too painful to see a computation which has already run for 3 days
> be killed by the OOM killer just because someone has opened a web browser,
> or in this case because they try to run gnuplot.
As I said my code only serves as a proof of concept. In case you care to
use it this is one of the things that should change.

Dimitris

--- gnuplot-4.0.0/src/datafile.c 2004-04-13 20:23:52.000000000 +0300
+++ datafile.c 2005-06-03 01:43:44.000000000 +0300
@@ -226,6 +226,8 @@
 #endif
 static TBOOLEAN mixed_data_fp = FALSE; /* inline data */
 static char *df_filename; /* name of data file */
+static off_t filesize;
+
 
 #ifndef MAXINT /* should there be one already defined ? */
 # ifdef INT_MAX /* in limits.h ? */
@@ -556,6 +558,75 @@
 
 /*}}} */
 
+#define PBS 30  /* Parse Buffer Size */
+
+static float **
+df_read_matrix_jimis(rows, cols)
+    int *rows, *cols;
+{
+ *rows = 0;
+ *cols = 0;
+ float **rmatrix = NULL;
+ float *tmp_arr = NULL;
+ int ch, newline, elements_num, j, error;
+ char par_buf[PBS]; /* Parse Buffer */
+
+ /* malloc the maximum we may use, it's ok in an overcommiting OS like linux */
+ tmp_arr = gp_alloc(filesize * sizeof(float), "df_matrix");
+
+ newline=1; elements_num=0; j=0; error=0;
+ do { /* Let the parsing begin! */
+ ch=fgetc(data_fp);
+ if (ch==' ' || ch=='\t' || ch=='\n' || ch=='\r' || ch==EOF) {
+ if (j!=0)  { /* if I have just read a field */
+ par_buf[j]='\0';
+ tmp_arr[elements_num++] = atof(par_buf);
+ j=0;
+ }
+ /* We count one line less if the file doesn't end with a newline*/
+ if (ch=='\n') {
+ /* one more line parsed */
+ (*rows)++;
+ newline=1;
+ }
+ }
+ else
+ if (newline)
+ /* Bypass lines starting with # */
+ if (ch=='#') {
+ do {
+ ch=fgetc(data_fp);
+ } while ((ch!='\n') && (ch!=EOF));
+ ungetc(ch, data_fp); ch=' ';
+ }
+ else
+ {
+ if (j<PBS) par_buf[j++]=ch;
+ newline=0;
+ }
+ else
+ if (j<PBS) par_buf[j++]=ch;
+ } while (ch!=EOF && !error);
+ if (*rows == 0) {
+ free(tmp_arr);
+ return NULL;
+ }
+ else
+ *cols = elements_num / *rows;
+ if (error) /* currently not so useful because atof() doesn't detect errors */
+ int_error(NO_CARET, "Bad Matrix");
+
+ /* Free unused memory */
+ tmp_arr = gp_realloc(tmp_arr, (*rows) * (*cols) * sizeof(float), "df_matrix");
+
+ /* fix the indexes of the 2D array */
+ rmatrix = gp_alloc((*rows) * sizeof(float *), "df_matrix");
+ for (j=0; j<(*rows); j++) {
+ rmatrix[j] = &(tmp_arr[j*(*cols)]);
+ }
+
+ return rmatrix;
+}
 
 /*{{{  int df_open(max_using) */
 int
@@ -739,6 +810,8 @@
     os_error(name_token, "\"%s\" is not a regular file or pipe",
      df_filename);
  }
+ if (stat(df_filename, &statbuf) > -1) filesize=statbuf.st_size;
+
 #endif /* HAVE_SYS_STAT_H */
  if ((data_fp = loadpath_fopen(df_filename, df_binary ? "rb" : "r")) ==
     (FILE *) NULL) {
@@ -1294,7 +1367,7 @@
  /* fread_matrix() drains the file */
  df_eof = 1;
     } else {
- if (!(dmatrix = df_read_matrix(&nr, &nc))) {
+ if (!(dmatrix = df_read_matrix_jimis(&nr, &nc))) {
     df_eof = 1;
     return 0;
  }
@@ -1408,7 +1481,10 @@
  this_plot->num_iso_read++;
     }
 
-    free_matrix(dmatrix, 0, nr - 1, 0);
+    if (df_binary)
+ free_matrix(dmatrix, 0, nr - 1, 0);
+    else
+ free(*dmatrix);
     if (rt)
  free_vector(rt, 0);
     if (ct)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Robert Hart
On Sat, 4 Jun 2005, Dimitrios Apostolou wrote:

> This time I used the format you suggested:
> diff -ur  <oldfile> <newfile>

The annoying thing about this patch is that it fails to show how your
code compares to what was there before, because you didn't actually
remove the old version.

Rob

--
         \ Robert Hart
       ___\______ [hidden email]
      /   ^    \_]======] http://www.nott.ac.uk/~enxrah
 ____[##########\_____
/ ___________________ \ 15 Benington Drive
\/{oOOOOOOOOOOOOOOOo}\/ Wollaton
  \o%%%%%%%%%%%%%%%o/ Nottingham
   ~~~~~~~~~~~~~~~~~ NG8 2TF


This message has been checked for viruses but the contents of an attachment
may still contain software viruses, which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Daniel J Sebald
In reply to this post by Dimitrios Apostolou
Dimitrios Apostolou wrote:

> Daniel J Sebald wrote:
>
>> Please *explain* why the patch is faster.  Those listening will
>> understand. Also, when running diff be sure to use unified (-u) so
>> that it indicates what file the hunks come from.
>
>
> I don't know why it is faster. I just wrote a simple parser. I don't
> understand what more the old parser does. I just can see that it is much
> more complicated. My guess is that after so many years of development
> and after many additions that today we see but can't figure out, the
> code became a bit "bloated".

(OK, saw this file in a backlog of email.  Hans must be sending these through
manually.)  It is a possibility.  That matrix code is sort of a tacked on thing.
  But unless we understand what the problem is how can we know that there is
extraneous, inefficient code?  However, if you believe what you've coded meets
the definitions and format in the documentation and is faster, then consider
redoing the patch with the old, unneeded code removed.  Otherwise, it will leave
cruft floating about, just what you are attempting to improve upon.

Thanks,

Dan


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Dimitrios Apostolou
> (OK, saw this file in a backlog of email.  Hans must be sending these
> through manually.)  It is a possibility.  That matrix code is sort of a
> tacked on thing.  But unless we understand what the problem is how can
> we know that there is extraneous, inefficient code?  However, if you
> believe what you've coded meets the definitions and format in the
> documentation and is faster, then consider redoing the patch with the
> old, unneeded code removed.  Otherwise, it will leave cruft floating
> about, just what you are attempting to improve upon.

My patch certainly doesn't meet any definitions or format. I have no
time right now to rewrite the patch correctly and according to the
coding standards. If you wish that I send you another patch with the old
code replaced, tell me so and I will. However it will be of the same
(low) quality, which I think is not ready to replace the current code.

Did anyone actually tried it to see the speed improvement? I have done
no benchmarks but what I described in an earlier email (how to plot big
SRTM files) now works *much* faster (*10 or more speed improvement).

For now I will be happy to see an entry in the TODO file about
optimizing the highly innefficient "matrix" parser. In the future, if
you haven't found the time to fix it, perhaps I will submit a proper
patch, compliant to the coding standards and good enough for you to use it.

Thanks,
Dimitris


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Petr Mikulik
In reply to this post by Dimitrios Apostolou
> Do what you think is better: optimize the current parser, rewrite a new
> one, or use mine as a base for improvement. One thing I know for sure:
> it shouldn't stay as it is.
>
> The patch I published is against the file:
> gnuplot-4.0.0/src/datafile.c

Can you please update it for the current datafile.c from cvs on sourceforge?
(There are minor rejects.)

---
PM




-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Daniel J Sebald
In reply to this post by Dimitrios Apostolou
Dimitrios Apostolou wrote:

> For now I will be happy to see an entry in the TODO file about
> optimizing the highly innefficient "matrix" parser. In the future, if
> you haven't found the time to fix it, perhaps I will submit a proper
> patch, compliant to the coding standards and good enough for you to use it.

You are familiar with the SourceForge "patch" site, aren't you?  What you are
proposing is a slightly bigger fix and hence requires some consideration.  Often
smaller patches submitted through discussion can go to CVS as a no-brainer.
However, in the case of larger patches, SourceForge is nice for maintaining
gradual development, both for other developers and yourself.  You can come back
to it whenever you have the free time.  (Consider creating a SourceForge account.)

Dan


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Robert Hart
In reply to this post by Dimitrios Apostolou
On Wed, 2005-06-08 at 15:31 +0300, Dimitrios Apostolou wrote:

> Did anyone actually tried it to see the speed improvement? I have done
> no benchmarks but what I described in an earlier email (how to plot big
> SRTM files) now works *much* faster (*10 or more speed improvement).

I've had a look into this using n x n ascii matrix (generated from
perl's sin() function)

Existing code: 66 seconds

after adding:

#define NO_FORTRAN_NUMS

to top of datafile.c: 5.9 seconds.

Rationale:

I used callgrind (and kcachegrind) to profile gnuplot whilst loading
this datafile. Turned out >95% of time was sping in the sscanf on
datafile.c:759

Questions: Where did this "NO_FORTRAN_NUMS" option come from, and why
isn't it enabled? Is Fortran number support useful? Could it be provided
in an alternative way (perhaps a run time option)?


Rob




-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Ethan Merritt
On Wednesday 08 June 2005 01:50 pm, Robert Hart wrote:
> I used callgrind (and kcachegrind) to profile gnuplot whilst loading
> this datafile. Turned out >95% of time was sping in the sscanf on
> datafile.c:759

Thank you for taking the time to pin this down.


> Questions: Where did this "NO_FORTRAN_NUMS" option come from, and why
> isn't it enabled? Is Fortran number support useful? Could it be provided
> in an alternative way (perhaps a run time option)?

Excellent questions.

Here's a brief excerpt from a Fortran manual:
  A double precision constant has the same form as a scaled real constant
  except that the E is replaced by D. Examples:
    6.1D2 is equivalent to 610.0
    +2.3D3 is equivalent to 2300.0
    -3.5D-1 is equivalent to -0.35
    +4D4 is equivalent to 40000

I have no idea how common this might be in real life data files
fed to gnuplot.

If you are willing, could you run one more check?
This entire section of code, with or without NO_FORTRAN_NUMS, is
inside a larger block which starts with the comment:

#ifdef OSK
            /* apparently %n does not work. This implementation
             * is just as good as the non-OSK one, but close
             * to a release (at last) we make it os-9 specific
             */
            int count;
            char *p = strpbrk(s, "dqDQ");
            if (p != NULL)
                *p = 'e';

            count = sscanf(s, "%lf", &df_column[df_no_cols].datum);
#else
   [Previously analysed code is here in the #else]

The question is whether this comment, which dates back at least to 1999,
is in fact correct.  Could you please compare the previous benchmarks
to the case where the code is prefixed by:   #define OSK 1  


I know this will break the checks for separators other than whitespace,
but we can sort that out afterwards. It would be nice to get a speed-up
of 10X by deleting 80+ lines of obfuscated code :-)




--
Ethan A Merritt       [hidden email]
Biomolecular Structure Center
Mailstop 357742
University of Washington, Seattle, WA 98195


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Robert Hart

On Wed, 8 Jun 2005, Ethan Merritt wrote:

> If you are willing, could you run one more check?
> This entire section of code, with or without NO_FORTRAN_NUMS, is
> inside a larger block which starts with the comment:
>
> #ifdef OSK
>             /* apparently %n does not work. This implementation
>              * is just as good as the non-OSK one, but close
>              * to a release (at last) we make it os-9 specific
>              */
>             int count;
>             char *p = strpbrk(s, "dqDQ");
>             if (p != NULL)
>                 *p = 'e';
>
>             count = sscanf(s, "%lf", &df_column[df_no_cols].datum);
> #else
>    [Previously analysed code is here in the #else]
>
> The question is whether this comment, which dates back at least to 1999,
> is in fact correct.  Could you please compare the previous benchmarks
> to the case where the code is prefixed by:   #define OSK 1  

This is much much worse. (Taking nearly 5 minutes on my benchmark)

Here's why:

In the standard code path, sscanf is used to get the next float out of the
input. Then, if the *NEXT CHARACTER* is a d, D, q, or Q, that character is
replaced with an "e" and the sscanf is repeated.

In the NO_FORTRANS_NUMS code path, atof is used to get the next
float. This is much faster than sscanf.

In the OSK code path, strpbrk is used to scan *THE ENTIRE INPUT LINE* for
the first occurence of d, D, q, or Q. This is *REPEATED* for every value
on the line. sscanf is used to read the values which is slow.

Rob








This message has been checked for viruses but the contents of an attachment
may still contain software viruses, which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Ethan Merritt
In reply to this post by Dimitrios Apostolou
On Wednesday 08 June 2005 03:53 pm, Robert Hart wrote:
> > The question is whether this comment, which dates back at least to 1999,
> > is in fact correct.  Could you please compare the previous benchmarks
> > to the case where the code is prefixed by:   #define OSK 1  
>
> This is much much worse. (Taking nearly 5 minutes on my benchmark)

OK. No surprise, but I thought it worth testing.

I think I'll remove that OSK chunk altogether, and make the NO_FORTRAN_NUMS
a run-time option.  Probably
        set datafile {no}fortran_floats

> In the standard code path, sscanf is used to get the next float out of the
> input. Then, if the *NEXT CHARACTER* is a d, D, q, or Q, that character is
> replaced with an "e" and the sscanf is repeated.
>
> In the NO_FORTRANS_NUMS code path, atof is used to get the next
> float. This is much faster than sscanf.

Are you saying that the speed gain could be achieved just by
replacing sscanf with atof in the standard code path, even though it still
takes the time to check for d/D/q/Q and rescan if found?

In that case, maybe we don't even need the run-time option.

        thanks,

                Ethan

--
Ethan A Merritt       [hidden email]
Biomolecular Structure Center
Mailstop 357742
University of Washington, Seattle, WA 98195


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Robert Hart
On Wed, 8 Jun 2005, Ethan Merritt wrote:

> Are you saying that the speed gain could be achieved just by
> replacing sscanf with atof in the standard code path, even though it still
> takes the time to check for d/D/q/Q and rescan if found?

Not quite. atof is definitely faster than sscanf, however, atof doesn't
tell you how many characters it actually read, so that is why (I presume)
that sscanf is used instead. However, in both cases, we manually scan for
the next seperator one character at a time, so it wouldn't be hard to
'notice' is we pass a dDqQ on the way.

Another option I haven't tried/benchmarked is to use strtof. If this is as
fast as atof, then we should use it, because it'll give us a pointer to
where it finished.

Anyway, bed time here, but I think any code that can be culled from this
function has got to be a win.

Good luck


Rob





This message has been checked for viruses but the contents of an attachment
may still contain software viruses, which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Hans-Bernhard Bröker
In reply to this post by Robert Hart
Robert Hart wrote:

> Questions: Where did this "NO_FORTRAN_NUMS" option come from,

 From discussions a while back, which I now only dimly remember.
The Fortran numbers parsing was added because people insisted they
needed it.  But it has a potential conflict with certain date/time
formats (those that have "Dec" immediately following a number, which was
being overwritten to "eec" by the attempt to parse it as a Fortran
number), and it makes things slow.

> and why isn't it enabled?

Because that would break reading of datafiles that people care about.

 > Is Fortran number support useful?

Yes.

> Could it be provided in an alternative way (perhaps a run time
> option)?

Probably.


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Hans-Bernhard Bröker
In reply to this post by Ethan Merritt
Ethan Merritt wrote:
> On Wednesday 08 June 2005 03:53 pm, Robert Hart wrote:

> I think I'll remove that OSK chunk altogether,

Please don't.  It's quite harmless as it is --- it'll only be used on
one platform, and you would risk breaking that platform completely by
doing this.

> and make the NO_FORTRAN_NUMS a run-time option.  Probably
> set datafile {no}fortran_floats

OK.

>>In the NO_FORTRANS_NUMS code path, atof is used to get the next
>>float. This is much faster than sscanf.

Is it?  Why would that be?  More to the point, atof has *no* error
handling capabilities, which I don't believe to be a good idea in this case.

> Are you saying that the speed gain could be achieved just by
> replacing sscanf with atof in the standard code path,

That wouldn't work at all --- atof leaves you with no way of knowing the
end of that number, no way to position to the next one, and no way to
find the 'D' in floating point number.  sscanf() and its %n format are
used for a reason.


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Ethan Merritt
On Wednesday 08 June 2005 09:47 pm, HBB wrote:

I assume Robert benchmarked atof because that's what is currently
in the source code as a compile-time alternative.  But for a real
change-over in the default code I think we would need to go with
strtod instead.

> Ethan Merritt wrote:
> >
> > I think I'll remove that OSK chunk altogether,
>
> Please don't.  It's quite harmless as it is --- it'll only be used on
> one platform, and you would risk breaking that platform completely by
> doing this.

I don't think it's likely to break anything. The problem is in the %n
format, which we can do without by using strtod instead.
But yeah, it's only a few lines of code.

> More to the point, atof has *no* error handling capabilities,
> which I don't believe to be a good idea in this case.

There is no error checking in the current code either, unless you
mean the count returned by sscanf. For true error checking we would
need strtod, unless I've overlooked some entirely different method.

> sscanf() and its %n format are used for a reason.

But %n is non-portable, as documented by the OSK comment and also
by the linux man pages.  So if Robert can benchmark the performance of
strtod as no worse than the current default code, then replacing both
paths with a strtod call may solve all these issues at one stroke.

I'm also curious about that convoluted series of tests added by
Corey Satten and labelled "optimization".  I'll ask him if he
remembers what sort of problem case or test suite he was using.
We have many more plot options now than when this optimization was
done, and it could well be either that it has become pointless or
that it is still benificial but has become incomplete.
Either way I'd like to know what the original rationale was.


--
Ethan A Merritt
Biomolecular Structure Center
University of Washington 98195-7742


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Hans-Bernhard Bröker
Ethan Merritt wrote:
> On Wednesday 08 June 2005 09:47 pm, HBB wrote:

>>More to the point, atof has *no* error handling capabilities,
>>which I don't believe to be a good idea in this case.

> There is no error checking in the current code either, unless you
> mean the count returned by sscanf.

That's exactly the error checking I'm talking about.  Without it,
handling missing or malformed data would be impossible.

 > For true error checking we would
> need strtod, unless I've overlooked some entirely different method.

There's not a lot you could do with strtod() that sscanf() couldn't do
just as well.

>>sscanf() and its %n format are used for a reason.

> But %n is non-portable, as documented by the OSK comment

Just because some silly implementation is buggy doesn't exactly mean %n
is unportable.  %n is as portable as you can ever hope to be: it's an
ANSI requirement.

> I'm also curious about that convoluted series of tests added by
> Corey Satten and labelled "optimization".  I'll ask him if he
> remembers what sort of problem case or test suite he was using.

I dimly remember this being about not running sscanf() on all columns of
a multi-column data file if no extended using specs are in use.  If you
have "using 1:2:3", you can simply ignore columns 4 to 1000.  If you
have "using 1:2:(column(some_function($3)), datafile.c has to convert
all 1000 of them.



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Daniel J Sebald
I've looked at the current code a bit.  It's quite involved, isn't it?  For
Dimitris' sake, there seems to be some details he may have not accounted for and
would be difficult to add to the patch.  So, would it pay for him to continue
with the patch he's supplied?  Or does it look like the current code will have
to be the starting point?

Let me point out one thing here.  With the introduction of binary support, I
seem to have rewritten df_read_matrix() so that it stores the *whole* set of
data as floats.  In that way, some other features could be reused.  Now, I don't
believe this is a problem as far as time.  Neither should it be a memory problem
(something that would really slow down Dimitris' machine perahps).  The block of
memory it consumes will not be much bigger than is eventually stored in memory.

But, what could be a problem--in light of the discussion about FORTRAN
doubles--is that instead of "floats" maybe this routine should be storing
doubles.  Certainly a plotting program doesn't need the resolution of doubles.
However, perhaps the data that is entered simply needs to have a very, very
large magnitude exponent and that is why people requested it.  (Of course, there
are ways around that, but...)

So think that over folks.  Should it be

static double *
df_read_matrix(int *rows, int *cols)
{
<snip>
     double *linearized_matrix = NULL;

etc.?  Because, if someone uses doubles and, infact, has numbers outside the
dynamic range of a "float", it's a problem.  Do doubles make sense further down
the line after df_read_matrix?


(more below)

Hans-Bernhard Bröker wrote:

>> There is no error checking in the current code either, unless you
>> mean the count returned by sscanf.
>
>
> That's exactly the error checking I'm talking about.  Without it,
> handling missing or malformed data would be impossible.

Unless one writes a more sophisticated parser.  Given how much testing is
already done here, that would almost be as good an option.  (But perhaps not
preferable.)

>
>  > For true error checking we would
>
>> need strtod, unless I've overlooked some entirely different method.
>
>
> There's not a lot you could do with strtod() that sscanf() couldn't do
> just as well.
>
>>> sscanf() and its %n format are used for a reason.
>
>
>> But %n is non-portable, as documented by the OSK comment
>
>
> Just because some silly implementation is buggy doesn't exactly mean %n
> is unportable.  %n is as portable as you can ever hope to be: it's an
> ANSI requirement.
>
>> I'm also curious about that convoluted series of tests added by
>> Corey Satten and labelled "optimization".  I'll ask him if he
>> remembers what sort of problem case or test suite he was using.
>
>
> I dimly remember this being about not running sscanf() on all columns of
> a multi-column data file if no extended using specs are in use.  If you
> have "using 1:2:3", you can simply ignore columns 4 to 1000.  If you
> have "using 1:2:(column(some_function($3)), datafile.c has to convert
> all 1000 of them.

I don't know how deep into those tests is the normal execution every time, but
too far and it's inefficient.

There are three tests here that can be done *before* reading data and combined
into a single test within tokenise:

(fast_columns == 0)
(df_no_use_specs == 0)
df_no_use_specs > 5

and somehow my intuition tells me the following could be done in a better way:

                || ((df_no_use_specs > 0)
                    && (use_spec[0].column == dfncp1
                        || (df_no_use_specs > 1
                            && (use_spec[1].column == dfncp1
                                || (df_no_use_specs > 2
                                    && (use_spec[2].column == dfncp1
                                        || (df_no_use_specs > 3
                                            && (use_spec[3].column == dfncp1
                                                || (df_no_use_specs > 4 && (use_spec[4].column == dfncp1 || )

Going from what Hans said, is there some logic that could be done beforehand to
set up a more efficient chunk of code?

NEXT ISSUE:

There is this chunk of code:

            if (count == 1 &&
                (s[used] == 'd' || s[used] == 'D' ||
                 s[used] == 'q' || s[used] == 'Q')) {
                /* HBB 20001221: avoid breaking parsing of time/date
                 * strings like 01Dec2000 that would be caused by
                 * overwriting the 'D' with an 'e'... */
                char save_char = s[used];

                /* might be fortran double */
                s[used] = 'e';
                /* and try again */
                count = sscanf(s, "%lf", &df_column[df_no_cols].datum);
                s[used] = save_char;
            }

Isn't this a bug?  If so, why not?  Certainly the date string 01Dec2000 should
not be interpretted as a double.  But doesn't a value of 1.0 result from
scanning 01eec2000?

I would think that one would have to test for a "+-0123456789" before and after
a "dDqQ".  Then one could feel free to change that character to an 'e' and not
worry about it.

Now, I see there are some strings to contend with.  Aside from that, what if one
initially goes through the string, using string searching routines to look for
"dDqQ" and convert them.  (Need a simple method to make sure they are not within
quotes, so one would probably have to include the double quote character in the
search somehow.)  Would it then be possible to use a much simpler routine?  Or
are we still left with using sscanf(), and that is the slow thing here?

Ethan, if

  set datafile {no}fortran_floats

is implemented, perhaps a simple test for #{dDqQ}# or #{eE}# on the first number
in the file would be useful for testing conflicts and giving a warning message.
  (I'm assuming it isn't valid to mix FORTRAN and C format in one file.)

Dan


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r 
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Robert Hart
In reply to this post by Hans-Bernhard Bröker

On Thu, 2005-06-09 at 15:26 +0800, Hans-Bernhard Bröker wrote:

> That's exactly the error checking I'm talking about.  Without it,
> handling missing or malformed data would be impossible.
>
> There's not a lot you could do with strtod() that sscanf() couldn't do
> just as well.

strtod() is indeed significantly faster than sscanf.

I am testing using the following:
#ifdef USE_STRTOD
              char *fin;
              df_column[df_no_cols].datum=strtod(s,&fin);
              used=s-fin;
              count=used?1:0;
#else
              count = sscanf(s, "%lf%n", &df_column[df_no_cols].datum, &used);
#endif

Can anybody come up with any cases where these two methods give
different results/error handling? strtod is ANSI C, so should be
portable.

I've attached a "non-intrusive" patch, however if this approach is
acceptable, I think the OSK path should be removed, and the
NO_FORTRAN_NUMS code should be removed or simplified. Are there any
cases when somebody wouldn't want a fortran float interpreting properly?

Rob
 

--
Robert Hart <[hidden email]>
University of Nottingham

This message has been checked for viruses but the contents of an attachment
may still contain software viruses, which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.


strtod.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: gnuplot slow on BIG files

Hans-Bernhard Bröker
Robert Hart wrote:

> strtod() is indeed significantly faster than sscanf.

And I still don't see why that should be the case...

> Can anybody come up with any cases where these two methods give
> different results/error handling?

Well, here's an ancient comment right taken directly from datafile.c:

  /* cannot trust strtod - eg strtod("-",&p) */

 > strtod is ANSI C, so should be
> portable.

So should sscanf().


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
gnuplot-beta mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
123
Loading...