conflict of separator and missing

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

conflict of separator and missing

Plotter-2

I have a data csv datafile which uses -99.99 as it missing data value.

with the following settings the 'missing' data are getting plotted.

  set datafile separator ","
  set datafile missing "-99.99"


  show datafile missing

         "-99.99" in datafile is interpreted as missing value


As a wild guess I tried the following and the missing data now get
correctly removed.

  set datafile missing "-99.99,"


This seems to be an illogical order of parsing.

Surely the data line needs to be parsed into its constituent data
columns before trying to detect the missing data string.

Regards, Peter


gnuplot> show version

         G N U P L O T
         Version 5.0 patchlevel 1    last modified 2015-06-07

         Copyright (C) 1986-1993, 1998, 2004, 2007-2015
         Thomas Williams, Colin Kelley and many others

         gnuplot home:     http://www.gnuplot.info
         faq, bugs, etc:   type "help FAQ"
         immediate help:   type "help"  (plot window: hit 'h')



------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

Plotter-2
On 14/06/16 09:43, [hidden email] wrote:

>
> I have a data csv datafile which uses -99.99 as it missing data value.
>
> with the following settings the 'missing' data are getting plotted.
>
>   set datafile separator ","
>   set datafile missing "-99.99"
>
>
>   show datafile missing
>
>          "-99.99" in datafile is interpreted as missing value
>
>
> As a wild guess I tried the following and the missing data now get
> correctly removed.
>
>   set datafile missing "-99.99,"
>
>
> This seems to be an illogical order of parsing.
>
> Surely the data line needs to be parsed into its constituent data
> columns before trying to detect the missing data string.
>
> Regards, Peter
>
>
> gnuplot> show version
>
>          G N U P L O T
>          Version 5.0 patchlevel 1    last modified 2015-06-07
>
>          Copyright (C) 1986-1993, 1998, 2004, 2007-2015
>          Thomas Williams, Colin Kelley and many others
>
>          gnuplot home:     http://www.gnuplot.info
>          faq, bugs, etc:   type "help FAQ"
>          immediate help:   type "help"  (plot window: hit 'h')
>
>
>

Just to complete this here is a sample line from the file displaying
this bug.

1958, 06,   21351, 1958.4548,   -99.99,   -99.99,     317.25,   315.14,
    317.25,   315.14


It seems that the data is being parsed into fields using the default
space delimiter when looking for the "missing" string.  It should be
using proper user-defined delimiter.

Peter.




------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

Allin Cottrell
On Tue, 14 Jun 2016, [hidden email] wrote:

> On 14/06/16 09:43, [hidden email] wrote:
>>
>> I have a data csv datafile which uses -99.99 as it missing data value.
>>
>> with the following settings the 'missing' data are getting plotted.
>>
>>   set datafile separator ","
>>   set datafile missing "-99.99"
>>
>>
>>   show datafile missing
>>
>>          "-99.99" in datafile is interpreted as missing value
>>
>>
>> As a wild guess I tried the following and the missing data now get
>> correctly removed.
>>
>>   set datafile missing "-99.99,"
>>
>>
>> This seems to be an illogical order of parsing.
>>
>> Surely the data line needs to be parsed into its constituent data
>> columns before trying to detect the missing data string.
>>
>> Regards, Peter
>>
>>
>> gnuplot> show version
>>
>>          G N U P L O T
>>          Version 5.0 patchlevel 1    last modified 2015-06-07
>>
>>          Copyright (C) 1986-1993, 1998, 2004, 2007-2015
>>          Thomas Williams, Colin Kelley and many others
>>
>>          gnuplot home:     http://www.gnuplot.info
>>          faq, bugs, etc:   type "help FAQ"
>>          immediate help:   type "help"  (plot window: hit 'h')
>>
>>
>>
>
> Just to complete this here is a sample line from the file displaying
> this bug.
>
> 1958, 06,   21351, 1958.4548,   -99.99,   -99.99,     317.25,   315.14,
>    317.25,   315.14

Doesn't it invite undefined behavior if you set "," as separator but
then also include spaces between the values?

Allin Cottrell

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

Plotter-2
On 14/06/16 12:52, Allin Cottrell wrote:

> On Tue, 14 Jun 2016, [hidden email] wrote:
>
>> On 14/06/16 09:43, [hidden email] wrote:
>>>
>>> I have a data csv datafile which uses -99.99 as it missing data value.
>>>
>>> with the following settings the 'missing' data are getting plotted.
>>>
>>>   set datafile separator ","
>>>   set datafile missing "-99.99"
>>>
>>>
>>>   show datafile missing
>>>
>>>          "-99.99" in datafile is interpreted as missing value
>>>
>>>
>>> As a wild guess I tried the following and the missing data now get
>>> correctly removed.
>>>
>>>   set datafile missing "-99.99,"
>>>
>>>
>>> This seems to be an illogical order of parsing.
>>>
>>> Surely the data line needs to be parsed into its constituent data
>>> columns before trying to detect the missing data string.
>>>
>>> Regards, Peter
>>>
>>>
>>> gnuplot> show version
>>>
>>>          G N U P L O T
>>>          Version 5.0 patchlevel 1    last modified 2015-06-07
>>>
>>>          Copyright (C) 1986-1993, 1998, 2004, 2007-2015
>>>          Thomas Williams, Colin Kelley and many others
>>>
>>>          gnuplot home:     http://www.gnuplot.info
>>>          faq, bugs, etc:   type "help FAQ"
>>>          immediate help:   type "help"  (plot window: hit 'h')
>>>
>>>
>>>
>>
>> Just to complete this here is a sample line from the file displaying
>> this bug.
>>
>> 1958, 06,   21351, 1958.4548,   -99.99,   -99.99,     317.25,   315.14,
>>    317.25,   315.14
>
> Doesn't it invite undefined behavior if you set "," as separator but
> then also include spaces between the values?
>
> Allin Cottrell
>

I'm not including anything, I have some data provided that I need to
plot with gnuplot.
I've always found gnuplot smart enough to deal with most things that
I've thrown at it. If this is not a bug I could always preprocess the
data to remove the commas.

The question remains as to whether this is a bug or not.

The fourth field in that line is "  -99.99" when using comma separator.

This data is supplied with a missing VALUE of -99.99 . Will this match
gnuplot's datafile missing defined as a string "-99.99"  ?  That depends
upon how the equality test is done in a language that has fuzzy
variable types.

But that does not account for the behaviour of it working with datafile
missing set to "-99.99,"

That seems to clearly indicate that there is logical problem here. The
comma should no longer be there in the data field since it is the field
separator.

Peter.




------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

sfeam
On Tuesday, 14 June, 2016 13:10:13 [hidden email] wrote:

> On 14/06/16 12:52, Allin Cottrell wrote:
> > On Tue, 14 Jun 2016, [hidden email] wrote:
> >
> >> On 14/06/16 09:43, [hidden email] wrote:
> >>>
> >>> I have a data csv datafile which uses -99.99 as it missing data value.
> >>>
> >>> with the following settings the 'missing' data are getting plotted.
> >>>
> >>>   set datafile separator ","
> >>>   set datafile missing "-99.99"
> >>>
> >>>
> >>>   show datafile missing
> >>>
> >>>          "-99.99" in datafile is interpreted as missing value
> >>>

I think you are correct that there is a bug in this part of the code.
The full-length string from 'set missing' is tested against the
start of the field contents (after removing leading whitespace);
then the subsequent character is tested to see if it is whitespace
rather than a continuation of whatever string is in the field.
So it works with a tab-separated *.csv file because <tab> counts
as whitespace, but fails with a comma-separated file because the
comma is mis-interpreted as part of the field content.

This should be fixed.  
The subsequent character should be tested for
        <next character is either whitespace or field-separator>.
Or maybe it should be
        <next character is  field-separator (which might be whitespace)>.
I'm not sure which is correct.

The difference would matter in a case like this:

set datafile separator comma
set datafile missing "ignore"

plot '-' using 1:3
1, 1, 1, 1
2, 2, ignore A, 2
3, 3, ignore B, 3
4, 4, ignore, 4
5, 5, 5, 5
e

In current gnuplot lines 2 and 3 will be treated as missing
but line 4 will not.
Should a fix result in only line 4 being ignored?
Or should all three lines be ignored?

> >>>
> >>> As a wild guess I tried the following and the missing data now get
> >>> correctly removed.
> >>>
> >>>   set datafile missing "-99.99,"
> >>>
> >>>
> >>> This seems to be an illogical order of parsing.

That will only work if there is whitespace following the comma.
So it's not a guaranteed work-around.

        Ethan


> >>>
> >>> Surely the data line needs to be parsed into its constituent data
> >>> columns before trying to detect the missing data string.
> >>>
> >>> Regards, Peter
> >>>
> >>>
> >>> gnuplot> show version
> >>>
> >>>          G N U P L O T
> >>>          Version 5.0 patchlevel 1    last modified 2015-06-07
> >>>
> >>>          Copyright (C) 1986-1993, 1998, 2004, 2007-2015
> >>>          Thomas Williams, Colin Kelley and many others
> >>>
> >>>          gnuplot home:     http://www.gnuplot.info
> >>>          faq, bugs, etc:   type "help FAQ"
> >>>          immediate help:   type "help"  (plot window: hit 'h')
> >>>
> >>>
> >>>
> >>
> >> Just to complete this here is a sample line from the file displaying
> >> this bug.
> >>
> >> 1958, 06,   21351, 1958.4548,   -99.99,   -99.99,     317.25,   315.14,
> >>    317.25,   315.14
> >
> > Doesn't it invite undefined behavior if you set "," as separator but
> > then also include spaces between the values?
> >
> > Allin Cottrell
> >
>
> I'm not including anything, I have some data provided that I need to
> plot with gnuplot.
> I've always found gnuplot smart enough to deal with most things that
> I've thrown at it. If this is not a bug I could always preprocess the
> data to remove the commas.
>
> The question remains as to whether this is a bug or not.
>
> The fourth field in that line is "  -99.99" when using comma separator.
>
> This data is supplied with a missing VALUE of -99.99 . Will this match
> gnuplot's datafile missing defined as a string "-99.99"  ?  That depends
> upon how the equality test is done in a language that has fuzzy
> variable types.
>
> But that does not account for the behaviour of it working with datafile
> missing set to "-99.99,"
>
> That seems to clearly indicate that there is logical problem here. The
> comma should no longer be there in the data field since it is the field
> separator.
>
> Peter.
>
>
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> gnuplot-beta mailing list
> [hidden email]
> Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://pubads.g.doubleclick.net/gampad/clk?id=1444514421&iu=/41014381
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

Plotter-2
On 14/06/16 21:00, Ethan A Merritt wrote:

> On Tuesday, 14 June, 2016 13:10:13 [hidden email] wrote:
>> On 14/06/16 12:52, Allin Cottrell wrote:
>>> On Tue, 14 Jun 2016, [hidden email] wrote:
>>>
>>>> On 14/06/16 09:43, [hidden email] wrote:
>>>>>
>>>>> I have a data csv datafile which uses -99.99 as it missing data value.
>>>>>
>>>>> with the following settings the 'missing' data are getting plotted.
>>>>>
>>>>>   set datafile separator ","
>>>>>   set datafile missing "-99.99"
>>>>>
>>>>>
>>>>>   show datafile missing
>>>>>
>>>>>          "-99.99" in datafile is interpreted as missing value
>>>>>
>
> I think you are correct that there is a bug in this part of the code.
> The full-length string from 'set missing' is tested against the
> start of the field contents (after removing leading whitespace);
> then the subsequent character is tested to see if it is whitespace
> rather than a continuation of whatever string is in the field.
> So it works with a tab-separated *.csv file because <tab> counts
> as whitespace, but fails with a comma-separated file because the
> comma is mis-interpreted as part of the field content.

Thanks Ethan,

First comment: a tab separated file is not a CSV file. The C mean comma
separated.

Your explanation seems to confirm my intuitive guess about how this was
being processed. There should never be a question of the 'missing' test
seeing the following comma since it is not the content of a field and I
think that is the origin of the bug.

I would suggest that the correct, structured way to do this is to break
into fields using the current field separator, then test whether any
fields match the missing string ( with the white-space caveats ).

If I follow your explanation, it would seem that currently the whole
line is scanned for the 'missing' string before it is split into fields,
or it is being parsed twice.

It seems logical that the line be split into fields before trying to
test the value of any field for any condition. This appears not to be
the case at the moment.


>
> This should be fixed.
> The subsequent character should be tested for
> <next character is either whitespace or field-separator>.
> Or maybe it should be
> <next character is  field-separator (which might be whitespace)>.
> I'm not sure which is correct.
>
> The difference would matter in a case like this:
>
> set datafile separator comma
> set datafile missing "ignore"
>
> plot '-' using 1:3
> 1, 1, 1, 1
> 2, 2, ignore A, 2
> 3, 3, ignore B, 3
> 4, 4, ignore, 4
> 5, 5, 5, 5
> e
>
> In current gnuplot lines 2 and 3 will be treated as missing
> but line 4 will not.
> Should a fix result in only line 4 being ignored?
> Or should all three lines be ignored?
>
>>>>>
>>>>> As a wild guess I tried the following and the missing data now get
>>>>> correctly removed.
>>>>>
>>>>>   set datafile missing "-99.99,"
>>>>>
>>>>>
>>>>> This seems to be an illogical order of parsing.
>
> That will only work if there is whitespace following the comma.
> So it's not a guaranteed work-around.
>
> Ethan
>
>
>>>>>
>>>>> Surely the data line needs to be parsed into its constituent data
>>>>> columns before trying to detect the missing data string.
>>>>>
>>>>> Regards, Peter
>>>>>
>>>>>
>>>>> gnuplot> show version
>>>>>
>>>>>          G N U P L O T
>>>>>          Version 5.0 patchlevel 1    last modified 2015-06-07
>>>>>
>>>>>          Copyright (C) 1986-1993, 1998, 2004, 2007-2015
>>>>>          Thomas Williams, Colin Kelley and many others
>>>>>
>>>>>          gnuplot home:     http://www.gnuplot.info
>>>>>          faq, bugs, etc:   type "help FAQ"
>>>>>          immediate help:   type "help"  (plot window: hit 'h')
>>>>>
>>>>>
>>>>>
>>>>
>>>> Just to complete this here is a sample line from the file displaying
>>>> this bug.
>>>>
>>>> 1958, 06,   21351, 1958.4548,   -99.99,   -99.99,     317.25,   315.14,
>>>>    317.25,   315.14
>>>
>>> Doesn't it invite undefined behavior if you set "," as separator but
>>> then also include spaces between the values?
>>>
>>> Allin Cottrell
>>>
>>
>> I'm not including anything, I have some data provided that I need to
>> plot with gnuplot.
>> I've always found gnuplot smart enough to deal with most things that
>> I've thrown at it. If this is not a bug I could always preprocess the
>> data to remove the commas.
>>
>> The question remains as to whether this is a bug or not.
>>
>> The fourth field in that line is "  -99.99" when using comma separator.
>>
>> This data is supplied with a missing VALUE of -99.99 . Will this match
>> gnuplot's datafile missing defined as a string "-99.99"  ?  That depends
>> upon how the equality test is done in a language that has fuzzy
>> variable types.
>>
>> But that does not account for the behaviour of it working with datafile
>> missing set to "-99.99,"
>>
>> That seems to clearly indicate that there is logical problem here. The
>> comma should no longer be there in the data field since it is the field
>> separator.
>>
>> Peter.
>>
>>
>>

Thanks.


2, 2, ignore A, 2
3, 3, ignore B, 3
4, 4, ignore, 4


IMO 2 and 3 should not match since the field is not equal to the
'missing'  string but simply contains it. This sounds like asking for
trouble. Allowing white space seems sensible flexibility on insisting on
an exact match since it is often added for human readability, as is the
case here.

Only something which IS the 'missing' string or the string with leading
and/or trailing white-space should match, IMO.

Peter.





------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://pubads.g.doubleclick.net/gampad/clk?id=1444514421&iu=/41014381
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

sfeam
On Tuesday, 14 June, 2016 23:00:19 [hidden email] wrote:

> On 14/06/16 21:00, Ethan A Merritt wrote:
> > On Tuesday, 14 June, 2016 13:10:13 [hidden email] wrote:
> >> On 14/06/16 12:52, Allin Cottrell wrote:
> >>> On Tue, 14 Jun 2016, [hidden email] wrote:
> >>>
> >>>> On 14/06/16 09:43, [hidden email] wrote:
> >>>>>
> >>>>> I have a data csv datafile which uses -99.99 as it missing data value.
> >>>>>
> >>>>> with the following settings the 'missing' data are getting plotted.
> >>>>>
> >>>>>   set datafile separator ","
> >>>>>   set datafile missing "-99.99"
> >>>>>
> >>>>>
> >>>>>   show datafile missing
> >>>>>
> >>>>>          "-99.99" in datafile is interpreted as missing value
> >>>>>
> >
> > I think you are correct that there is a bug in this part of the code.
> > The full-length string from 'set missing' is tested against the
> > start of the field contents (after removing leading whitespace);
> > then the subsequent character is tested to see if it is whitespace
> > rather than a continuation of whatever string is in the field.
> > So it works with a tab-separated *.csv file because <tab> counts
> > as whitespace, but fails with a comma-separated file because the
> > comma is mis-interpreted as part of the field content.
>
> Thanks Ethan,
>
> First comment: a tab separated file is not a CSV file. The C mean comma
> separated.

In practice this is not true.  Pretty much any program I know of that
supports *.csv files allows you to specify what character is used as
a field separator.

Quoting Wikipedia:

 "the term "CSV" also denotes some closely related delimiter-separated
  formats that use different field delimiters. These include tab-separated
  values and space-separated values. A delimiter that is not present in
  the field data (such as tab) keeps the format parsing simple.
  These alternate delimiter-separated files are often even given a
  .csv extension, despite the use of a non-comma field separator."

> I would suggest that the correct, structured way to do this is to break
> into fields using the current field separator, then test whether any
> fields match the missing string ( with the white-space caveats ).
>
> If I follow your explanation, it would seem that currently the whole
> line is scanned for the 'missing' string before it is split into fields,
> or it is being parsed twice.

Not quite.  The input line is scanned for field separators, the start
of each field is noted, then it goes back to process them one-by-one.

>
> 2, 2, ignore A, 2
> 3, 3, ignore B, 3
> 4, 4, ignore, 4
>
>
> IMO 2 and 3 should not match since the field is not equal to the
> 'missing'  string but simply contains it. This sounds like asking for
> trouble. Allowing white space seems sensible flexibility on insisting on
> an exact match since it is often added for human readability, as is the
> case here.
>
> Only something which IS the 'missing' string or the string with leading
> and/or trailing white-space should match, IMO.

The conventional indication of missing data in a *.csv file is simply
an empty field.  This obviously is not possible in a whitespace-separated
file.  Gnuplot's use of "set missing" is outside any standard practice
I know of for csv files, so anything we choose is likely to strike
someone as wrong.

For instance, RFC-4180, the closest thing to a csv standard, states that
"any field may be quoted with double quotes".  So in the example above,
should we ignore this line?
  5, 5, "ignore", 5
This one?
  5, 5, " ignore ", 5


        Ethan

>
> Peter.


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://pubads.g.doubleclick.net/gampad/clk?id=1444514421&iu=/41014381
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

Plotter-2
On 15/06/16 00:06, Ethan A Merritt wrote:

> On Tuesday, 14 June, 2016 23:00:19 [hidden email] wrote:
>> On 14/06/16 21:00, Ethan A Merritt wrote:
>>> On Tuesday, 14 June, 2016 13:10:13 [hidden email] wrote:
>>>> On 14/06/16 12:52, Allin Cottrell wrote:
>>>>> On Tue, 14 Jun 2016, [hidden email] wrote:
>>>>>
>>>>>> On 14/06/16 09:43, [hidden email] wrote:
>>>>>>>
>>>>>>> I have a data csv datafile which uses -99.99 as it missing data value.
>>>>>>>
>>>>>>> with the following settings the 'missing' data are getting plotted.
>>>>>>>
>>>>>>>   set datafile separator ","
>>>>>>>   set datafile missing "-99.99"
>>>>>>>
>>>>>>>
>>>>>>>   show datafile missing
>>>>>>>
>>>>>>>          "-99.99" in datafile is interpreted as missing value
>>>>>>>
>>>
>>> I think you are correct that there is a bug in this part of the code.
>>> The full-length string from 'set missing' is tested against the
>>> start of the field contents (after removing leading whitespace);
>>> then the subsequent character is tested to see if it is whitespace
>>> rather than a continuation of whatever string is in the field.
>>> So it works with a tab-separated *.csv file because <tab> counts
>>> as whitespace, but fails with a comma-separated file because the
>>> comma is mis-interpreted as part of the field content.
>>
>> Thanks Ethan,
>>
>> First comment: a tab separated file is not a CSV file. The C mean comma
>> separated.
>
> In practice this is not true.  Pretty much any program I know of that
> supports *.csv files allows you to specify what character is used as
> a field separator.
>
> Quoting Wikipedia:
>
>  "the term "CSV" also denotes some closely related delimiter-separated
>   formats that use different field delimiters. These include tab-separated
>   values and space-separated values. A delimiter that is not present in
>   the field data (such as tab) keeps the format parsing simple.
>   These alternate delimiter-separated files are often even given a
>   .csv extension, despite the use of a non-comma field separator."
>
>> I would suggest that the correct, structured way to do this is to break
>> into fields using the current field separator, then test whether any
>> fields match the missing string ( with the white-space caveats ).
>>
>> If I follow your explanation, it would seem that currently the whole
>> line is scanned for the 'missing' string before it is split into fields,
>> or it is being parsed twice.
>
> Not quite.  The input line is scanned for field separators, the start
> of each field is noted, then it goes back to process them one-by-one.
>
>>
>> 2, 2, ignore A, 2
>> 3, 3, ignore B, 3
>> 4, 4, ignore, 4
>>
>>
>> IMO 2 and 3 should not match since the field is not equal to the
>> 'missing'  string but simply contains it. This sounds like asking for
>> trouble. Allowing white space seems sensible flexibility on insisting on
>> an exact match since it is often added for human readability, as is the
>> case here.
>>
>> Only something which IS the 'missing' string or the string with leading
>> and/or trailing white-space should match, IMO.
>
> The conventional indication of missing data in a *.csv file is simply
> an empty field.  This obviously is not possible in a whitespace-separated
> file.  Gnuplot's use of "set missing" is outside any standard practice
> I know of for csv files, so anything we choose is likely to strike
> someone as wrong.
>
> For instance, RFC-4180, the closest thing to a csv standard, states that
> "any field may be quoted with double quotes".  So in the example above,
> should we ignore this line?
>   5, 5, "ignore", 5
> This one?
>   5, 5, " ignore ", 5
>
>
> Ethan
>
>>
>> Peter.
>
>


Ok, in the absence of any properly defined standard , where software
like Excel ( probably the most common source of "CSV" files for a lot of
people ) produces comma separated variables without using commas, it is
likely to be messy.

  5, 5, "ignore", 5

This seems a bit of a contrived case, what software will quote one field
in a line but not the others?

How would gnuplot cope with :

  "5","5", "ignore", "5"

Looking at the bug I reported may be a chance to review this whole messy
subject but it seems like a diversion from the clear bug case.

If gnuplot scans for the position of the field separators, it should be
stopping BEFORE it gets to the next one when testing for occurrences of
the  'missing' string.

That seems to be a simple bug that does not open a whole can of csv worms.

Peter.








------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://pubads.g.doubleclick.net/gampad/clk?id=1444514421&iu=/41014381
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

Plotter-2
In reply to this post by sfeam
On 15/06/16 00:06, Ethan A Merritt wrote:
>> I think you are correct that there is a bug in this part of the code.
>> > > The full-length string from 'set missing' is tested against the
>> > > start of the field contents (after removing leading whitespace);
>> > > then the subsequent character is tested to see if it is whitespace
>> > > rather than a continuation of whatever string is in the field.
>> > > So it works with a tab-separated *.csv file because <tab> counts
>> > > as whitespace, but fails with a comma-separated file because the
>> > > comma is mis-interpreted as part of the field content.
>>

further thoughts:

The cause of this then, seems to be an oversight. The end of the string
is being tested as though it was the default case of WSpace separators
and not the specified separator.

It is a little irrelevant what name is give to this sort of file, the
key point is that the check you describe is not using the current
datafile separator.

Presumably the same thing would happen if someone had a file using colon
( or any other non WS char ) as separator and had correctly specified it
with


set datafile separator ":"

I have not tested this explicitly but there is nothing special about
using comma sep. so I presume the same bug would manifest.

" the subsequent character is tested to see if it is whitespace"

It seems that this test should be firstly a test for 'separator' and
then additionally for white-space + separator. As previously stated
"ignore A" probably should count as a match. Substrings counting as a
match is not described anywhere and I see not reason for this to be
taken as a hit.

Thanks for looking into this.



Peter.


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://pubads.g.doubleclick.net/gampad/clk?id=1444514421&iu=/41014381
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

sfeam
In reply to this post by Plotter-2
On Wednesday, 15 June 2016 08:40:46 AM [hidden email] wrote:

> On 15/06/16 00:06, Ethan A Merritt wrote:
> >> Only something which IS the 'missing' string or the string with leading
> >> and/or trailing white-space should match, IMO.
> >
> > The conventional indication of missing data in a *.csv file is simply
> > an empty field.  This obviously is not possible in a whitespace-separated
> > file.  Gnuplot's use of "set missing" is outside any standard practice
> > I know of for csv files, so anything we choose is likely to strike
> > someone as wrong.
> >
> > For instance, RFC-4180, the closest thing to a csv standard, states that
> > "any field may be quoted with double quotes".  So in the example above,
> > should we ignore this line?
> >   5, 5, "ignore", 5
> > This one?
> >   5, 5, " ignore ", 5
> >
> >
> > Ethan
> >
> >>
> >> Peter.
> >
> >
>
>
> Ok, in the absence of any properly defined standard , where software
> like Excel ( probably the most common source of "CSV" files for a lot of
> people ) produces comma separated variables without using commas, it is
> likely to be messy.
>
>   5, 5, "ignore", 5
>
> This seems a bit of a contrived case, what software will quote one field
> in a line but not the others?

Excel for one.  It depends on what "format type" you assign to the column.

> How would gnuplot cope with :
>
>   "5","5", "ignore", "5"

Gnuplot explicitly checks for both numerical and quoted numerical input
in csv files exactly because of this issue.  But the concept of checking for both
quoted and unquoted "missing" strings never occurred to me until just now.

> Looking at the bug I reported may be a chance to review this whole messy
> subject but it seems like a diversion from the clear bug case.
>
> If gnuplot scans for the position of the field separators, it should be
> stopping BEFORE it gets to the next one when testing for occurrences of
> the  'missing' string.
>
> That seems to be a simple bug that does not open a whole can of csv worms.

Yeah, but while revisiting the code it seems like a good idea to not only fix
the specific case in the bug report but also any other corner cases we can
think of.

        Ethan


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://pubads.g.doubleclick.net/gampad/clk?id=1444514421&iu=/41014381
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

Tait
In reply to this post by Plotter-2

> > The conventional indication of missing data in a *.csv file is simply
> > an empty field.  This obviously is not possible in a whitespace-separated
> > file.  Gnuplot's use of "set missing" is outside any standard practice
> > I know of for csv files, so anything we choose is likely to strike
> > someone as wrong.

I don't follow the comment in the second sentence. Empty fields in a
TSV file* are indicated by having no data in between the field
separators. Not only is it possible, but it's quite intuitive, I
think. I'm adding spaces for clarity, but those spaces wouldn't be
in the actual file:

  header1 \t header2 \t header3 \t header4
  data1 \t data2 \t data3 \t data4
  data5 \t       \t data7 \t data8
  ...

Where data6 would be, is an empty field.

(* as an aside, TSV or "tab-separated text" is the term I always see
used for tab-separated values. I've never heard of someone refer to
a tab-separated file as "CSV".)

> > For instance, RFC-4180, the closest thing to a csv standard, states that
> > "any field may be quoted with double quotes".  So in the example above,
> > should we ignore this line?
> >   5, 5, "ignore", 5
> > This one?
> >   5, 5, " ignore ", 5
>
> Ok, in the absence of any properly defined standard , where software
> like Excel ( probably the most common source of "CSV" files for a lot of
> people ) produces comma separated variables without using commas, it is
> likely to be messy.
>
>   5, 5, "ignore", 5
>
> This seems a bit of a contrived case, what software will quote one field
> in a line but not the others?

Excel does exactly this. Fields are unquoted in general, but (only)
if they contain delimiter or quoting characters, then they are
quoted. If they contain quote characters, quotes are double-quoted.
Delimiter characters are not just "," for CSV, but also newlines.
Consider three rows of data, each containing two fields:

  row 1: ab    cd
  row 2: e\nf  g,h
  row 3: i"j   k<space>m

Excel will produce a CSV that looks like this:

  ab,cd
  "e
  f","g,h"
  "i""j",k m

This is obviously a contrived pathological case, but it's
illustrative of what common software "out there" might do.
Of course, backslash-escaping is also a common convention,
and for the same input, it might produce a CSV like:

  ab,cd
  e\
  f,g\,h
  i"j,k m

As Ethan mentioned, any convention will break some
expectations/compatibility, unless the plan is to build in
a wide range of application- or convention-specific input
filters. (And those filters implemented in Perl is usually
how I get by and produce the format gnuplot expects.)


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://pubads.g.doubleclick.net/gampad/clk?id=1444514421&iu=/41014381
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

sfeam
On Wednesday, 15 June 2016 06:36:53 PM Tait wrote:
>
> > > The conventional indication of missing data in a *.csv file is simply
> > > an empty field.  This obviously is not possible in a whitespace-separated
> > > file.  Gnuplot's use of "set missing" is outside any standard practice
> > > I know of for csv files, so anything we choose is likely to strike
> > > someone as wrong.
>
> I don't follow the comment in the second sentence.

I was contrasting csv files to whitespace-separated files.

If you have only whitespace to separate values, then you can't simply
omit a field because this is indistinguishable from shifting all the
remaining fields over by one.  That's why we need a "missing"
placeholder.  In a csv file you shouldn't really need a "missing"
placeholder because an empty field is unambiguous.

        Ethan

> Empty fields in a
> TSV file* are indicated by having no data in between the field
> separators. Not only is it possible, but it's quite intuitive, I
> think. I'm adding spaces for clarity, but those spaces wouldn't be
> in the actual file:
>
>   header1 \t header2 \t header3 \t header4
>   data1 \t data2 \t data3 \t data4
>   data5 \t       \t data7 \t data8
>   ...
>
> Where data6 would be, is an empty field.
>
> (* as an aside, TSV or "tab-separated text" is the term I always see
> used for tab-separated values. I've never heard of someone refer to
> a tab-separated file as "CSV".)
>
> > > For instance, RFC-4180, the closest thing to a csv standard, states that
> > > "any field may be quoted with double quotes".  So in the example above,
> > > should we ignore this line?
> > >   5, 5, "ignore", 5
> > > This one?
> > >   5, 5, " ignore ", 5
> >
> > Ok, in the absence of any properly defined standard , where software
> > like Excel ( probably the most common source of "CSV" files for a lot of
> > people ) produces comma separated variables without using commas, it is
> > likely to be messy.
> >
> >   5, 5, "ignore", 5
> >
> > This seems a bit of a contrived case, what software will quote one field
> > in a line but not the others?
>
> Excel does exactly this. Fields are unquoted in general, but (only)
> if they contain delimiter or quoting characters, then they are
> quoted. If they contain quote characters, quotes are double-quoted.
> Delimiter characters are not just "," for CSV, but also newlines.
> Consider three rows of data, each containing two fields:
>
>   row 1: ab    cd
>   row 2: e\nf  g,h
>   row 3: i"j   k<space>m
>
> Excel will produce a CSV that looks like this:
>
>   ab,cd
>   "e
>   f","g,h"
>   "i""j",k m
>
> This is obviously a contrived pathological case, but it's
> illustrative of what common software "out there" might do.
> Of course, backslash-escaping is also a common convention,
> and for the same input, it might produce a CSV like:
>
>   ab,cd
>   e\
>   f,g\,h
>   i"j,k m
>
> As Ethan mentioned, any convention will break some
> expectations/compatibility, unless the plan is to build in
> a wide range of application- or convention-specific input
> filters. (And those filters implemented in Perl is usually
> how I get by and produce the format gnuplot expects.)
>


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://pubads.g.doubleclick.net/gampad/clk?id=1444514421&iu=/41014381
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

Plotter-2
On 15/06/16 20:11, sfeam wrote:

> On Wednesday, 15 June 2016 06:36:53 PM Tait wrote:
>>
>>>> The conventional indication of missing data in a *.csv file is simply
>>>> an empty field.  This obviously is not possible in a whitespace-separated
>>>> file.  Gnuplot's use of "set missing" is outside any standard practice
>>>> I know of for csv files, so anything we choose is likely to strike
>>>> someone as wrong.
>>
>> I don't follow the comment in the second sentence.
>
> I was contrasting csv files to whitespace-separated files.
>
> If you have only whitespace to separate values, then you can't simply
> omit a field because this is indistinguishable from shifting all the
> remaining fields over by one.  That's why we need a "missing"
> placeholder.  In a csv file you shouldn't really need a "missing"
> placeholder because an empty field is unambiguous.
>
> Ethan

It should be remembered that csv are rarely a data storage object but
simply a text dump of something else. I don't think anyone would chose
csv as a working file format. It's more a means of transmission.

So the question is what is the source data format that the csv is a dump
of.

Missing value flags like -999 etc are commonly used in software storing
data in numerical arrays where every datum must have a finite value
assigned. An array value cannot be 'empty'. A csv dump from such
software will most likely preserve the missing value flag rather then
strip them out.

Even though a spreadsheet can have an empty cell, it is often useful to
have an affirmative flag that indicates that the cell has been processed
and determined to be a missing datum, rather than just  having been
overlooked or not yet processed.

So even though a csv file can represent an empty value, this does not
mean missing value marker is not necessary.

Peter.





>
>> Empty fields in a
>> TSV file* are indicated by having no data in between the field
>> separators. Not only is it possible, but it's quite intuitive, I
>> think. I'm adding spaces for clarity, but those spaces wouldn't be
>> in the actual file:
>>
>>   header1 \t header2 \t header3 \t header4
>>   data1 \t data2 \t data3 \t data4
>>   data5 \t       \t data7 \t data8
>>   ...
>>
>> Where data6 would be, is an empty field.
>>
>> (* as an aside, TSV or "tab-separated text" is the term I always see
>> used for tab-separated values. I've never heard of someone refer to
>> a tab-separated file as "CSV".)
>>
>>>> For instance, RFC-4180, the closest thing to a csv standard, states that
>>>> "any field may be quoted with double quotes".  So in the example above,
>>>> should we ignore this line?
>>>>   5, 5, "ignore", 5
>>>> This one?
>>>>   5, 5, " ignore ", 5
>>>
>>> Ok, in the absence of any properly defined standard , where software
>>> like Excel ( probably the most common source of "CSV" files for a lot of
>>> people ) produces comma separated variables without using commas, it is
>>> likely to be messy.
>>>
>>>   5, 5, "ignore", 5
>>>
>>> This seems a bit of a contrived case, what software will quote one field
>>> in a line but not the others?
>>
>> Excel does exactly this. Fields are unquoted in general, but (only)
>> if they contain delimiter or quoting characters, then they are
>> quoted. If they contain quote characters, quotes are double-quoted.
>> Delimiter characters are not just "," for CSV, but also newlines.
>> Consider three rows of data, each containing two fields:
>>
>>   row 1: ab    cd
>>   row 2: e\nf  g,h
>>   row 3: i"j   k<space>m
>>
>> Excel will produce a CSV that looks like this:
>>
>>   ab,cd
>>   "e
>>   f","g,h"
>>   "i""j",k m
>>
>> This is obviously a contrived pathological case, but it's
>> illustrative of what common software "out there" might do.
>> Of course, backslash-escaping is also a common convention,
>> and for the same input, it might produce a CSV like:
>>
>>   ab,cd
>>   e\
>>   f,g\,h
>>   i"j,k m
>>
>> As Ethan mentioned, any convention will break some
>> expectations/compatibility, unless the plan is to build in
>> a wide range of application- or convention-specific input
>> filters. (And those filters implemented in Perl is usually
>> how I get by and produce the format gnuplot expects.)
>>
>
>


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://pubads.g.doubleclick.net/gampad/clk?id=1444514421&iu=/41014381
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

sfeam
In reply to this post by Plotter-2
On Wednesday, 15 June 2016 11:27:27 AM [hidden email] wrote:

> On 15/06/16 00:06, Ethan A Merritt wrote:
> >> I think you are correct that there is a bug in this part of the code.
> >> > > The full-length string from 'set missing' is tested against the
> >> > > start of the field contents (after removing leading whitespace);
> >> > > then the subsequent character is tested to see if it is whitespace
> >> > > rather than a continuation of whatever string is in the field.
> >> > > So it works with a tab-separated *.csv file because <tab> counts
> >> > > as whitespace, but fails with a comma-separated file because the
> >> > > comma is mis-interpreted as part of the field content.
> >>
>
> further thoughts:
>
> The cause of this then, seems to be an oversight. The end of the string
> is being tested as though it was the default case of WSpace separators
> and not the specified separator.
>
> It is a little irrelevant what name is give to this sort of file, the
> key point is that the check you describe is not using the current
> datafile separator.
>
> Presumably the same thing would happen if someone had a file using colon
> ( or any other non WS char ) as separator and had correctly specified it
> with
>
>
> set datafile separator ":"
>
> I have not tested this explicitly but there is nothing special about
> using comma sep. so I presume the same bug would manifest.
>
> " the subsequent character is tested to see if it is whitespace"
>
> It seems that this test should be firstly a test for 'separator' and
> then additionally for white-space + separator. As previously stated
> "ignore A" probably should count as a match. Substrings counting as a
> match is not described anywhere and I see not reason for this to be
> taken as a hit.
>
> Thanks for looking into this.
> Peter.

I have made a change to datafile.c:check_missing() in CVS for both 5.0 and 5.1.
In the case of a csv file (i.e. "set datafile separator" is non-blank) it now checks
for a match of the field contents to the "missing" string and requires that the
next character is a field-terminator.

Notes:
-   Leading whitespace is ignore but trailing whitespace is not.  
-   This is a obviously a change, so possibly there are existing scripts that break.
-   The comparison is to a string, not a numerical value, so -99.00 ne -99.0 ne -99  
-   If the "missing" string is quoted in the data file it will not be recognized.
-   If the "missing" string itself contains quotes, the behaviour is not specified

This change does not include an earlier suggestion to provide an option that
causes NaN (not-a-number) values to be treated as missing data.
I am inclined to add this also, probably as a new keyword  "set datafile missing NaN".
See `help missing` for detail on the current handling of missing and NaN values.

        Ethan


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

Plotter-2
On 16/06/16 17:23, sfeam wrote:


> -   The comparison is to a string, not a numerical value, so -99.00 ne -99.0 ne -99
Sounds reasonable. From the gnuplot POV it is a missing *string* ; if
the cvs is output by a spreadsheet or other software it seems reasonable
to expect consistent string formatting ( although Excel could have
different cell formats, that is probably too much to try and anticipate.


 >-   Leading whitespace is ignore but trailing whitespace is not.

Seems inconsistent. Was this a programming convenience for minimal
coding changes or is there a functional logic behind this?

 > ... and requires that the next character is a field-terminator.

I presume field-terminator.means FS or EOL. Does this cater for WS at
end of line without an explicit FS, or does this fall foul of previous
point?

Thanks.  Peter.

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

sfeam
On Thursday, 16 June, 2016 18:00:00 [hidden email] wrote:

> On 16/06/16 17:23, sfeam wrote:
>
>
> > -   The comparison is to a string, not a numerical value, so -99.00 ne -99.0 ne -99
> Sounds reasonable. From the gnuplot POV it is a missing *string* ; if
> the cvs is output by a spreadsheet or other software it seems reasonable
> to expect consistent string formatting ( although Excel could have
> different cell formats, that is probably too much to try and anticipate.
>
>
>  >-   Leading whitespace is ignore but trailing whitespace is not.
>
> Seems inconsistent. Was this a programming convenience for minimal
> coding changes or is there a functional logic behind this?

I though the consensus from a couple of days ago was that any difference
in the remainder of the field was significant, hence extra trailing
characters would mean that the match was imperfect.
Previously "missing A" and "missing B" were both matched as "missing".
Now they are not, even if A is a <tab> or '\n' or '\r'.

>  > ... and requires that the next character is a field-terminator.
>
> I presume field-terminator.means FS or EOL.

Separator or null.  
EOL is legal with in a csv field, although if you have such a file good
luck to you. When a line of data is read in to gnuplot it is transferred
to a null-terminated string, so the check for null should catch the true
end-of-line.

> Does this cater for WS at end of line without an explicit FS,
> or does this fall foul of previous point?

You mean like a DOS-style file with <cr><nl> at the end of the line?
So far as I know this is properly handled by stripping away both line
termination characters on input.  But more testing wouldn't hurt.

        Ethan




        Ethan

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

Plotter-2
On 16/06/16 19:44, Ethan A Merritt wrote:

> On Thursday, 16 June, 2016 18:00:00 [hidden email] wrote:
>> On 16/06/16 17:23, sfeam wrote:
>>
>>
>>> -   The comparison is to a string, not a numerical value, so -99.00 ne -99.0 ne -99
>> Sounds reasonable. From the gnuplot POV it is a missing *string* ; if
>> the cvs is output by a spreadsheet or other software it seems reasonable
>> to expect consistent string formatting ( although Excel could have
>> different cell formats, that is probably too much to try and anticipate.
>>
>>
>>  >-   Leading whitespace is ignore but trailing whitespace is not.
>>
>> Seems inconsistent. Was this a programming convenience for minimal
>> coding changes or is there a functional logic behind this?
>
> I though the consensus from a couple of days ago was that any difference
> in the remainder of the field was significant, hence extra trailing
> characters would mean that the match was imperfect.
> Previously "missing A" and "missing B" were both matched as "missing".
> Now they are not, even if A is a <tab> or '\n' or '\r'.

I don't know what the consensus was but my comment on that was that
trailing WS should be stripped, as it is with leading WS.  I was
suggesting that any non-WS following the missing string meant the match
failed. Specifically relating to your  " ignore A" case.  I did not
suggest  WS"ignore"WS should fail.

If leading space is stripped, I'm not sure I see why trailing is not
also stripped.



>
>>  > ... and requires that the next character is a field-terminator.
>>
>> I presume field-terminator.means FS or EOL.
>
> Separator or null.
> EOL is legal with in a csv field, although if you have such a file good
> luck to you. When a line of data is read in to gnuplot it is transferred
> to a null-terminated string, so the check for null should catch the true
> end-of-line.
>
>> Does this cater for WS at end of line without an explicit FS,
>> or does this fall foul of previous point?
>
> You mean like a DOS-style file with <cr><nl> at the end of the line?
> So far as I know this is properly handled by stripping away both line
> termination characters on input.  But more testing wouldn't hurt.
>
> Ethan
>

No , I was not talking about CRLF end of line.

It is quite common to have 'invisible' WS   after the last field and
being the last field probably no FS.  This is especially the case if
there was a comment :  WS to provide visual separation or align  comments:

1,2,3,-999             # last column data got lost in paper records !

Once the # is replaced by #0 to truncate out the comment , this line
would fall foul of your new scheme I think.

I see no real reason not to remove the tailing WS , it seems a little
odd to strip one end an not the other.

CSV is pretty illegible at the best of times. If need to dump a
spreadsheet to CSV I often separate with " , " to make the result a
little easier to read afterwards.

Unless I'm missing something , I don't see any reason or advantage to
not stripping trailing WS.


Not wishing to be finicky, but you seemed interesting is considering any
corner cases.

Peter.




>
>
>
> Ethan
>


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

sfeam
On Thursday, 16 June, 2016 20:46:00 [hidden email] wrote:

> On 16/06/16 19:44, Ethan A Merritt wrote:
> > On Thursday, 16 June, 2016 18:00:00 [hidden email] wrote:
> >> On 16/06/16 17:23, sfeam wrote:
> >>
> >>
> >>> -   The comparison is to a string, not a numerical value, so -99.00 ne -99.0 ne -99
> >> Sounds reasonable. From the gnuplot POV it is a missing *string* ; if
> >> the cvs is output by a spreadsheet or other software it seems reasonable
> >> to expect consistent string formatting ( although Excel could have
> >> different cell formats, that is probably too much to try and anticipate.
> >>
> >>
> >>  >-   Leading whitespace is ignored but trailing whitespace is not.
> >>
> >> Seems inconsistent. Was this a programming convenience for minimal
> >> coding changes or is there a functional logic behind this?
> >
> > I though the consensus from a couple of days ago was that any difference
> > in the remainder of the field was significant, hence extra trailing
> > characters would mean that the match was imperfect.
> > Previously "missing A" and "missing B" were both matched as "missing".
> > Now they are not, even if A is a <tab> or '\n' or '\r'.
>
> I don't know what the consensus was but my comment on that was that
> trailing WS should be stripped, as it is with leading WS.  I was
> suggesting that any non-WS following the missing string meant the match
> failed. Specifically relating to your  " ignore A" case.  I did not
> suggest  WS"ignore"WS should fail.
>
> If leading space is stripped, I'm not sure I see why trailing is not
> also stripped.
>
>
>
> >
> >>  > ... and requires that the next character is a field-terminator.
> >>
> >> I presume field-terminator.means FS or EOL.
> >
> > Separator or null.
> > EOL is legal with in a csv field, although if you have such a file good
> > luck to you. When a line of data is read in to gnuplot it is transferred
> > to a null-terminated string, so the check for null should catch the true
> > end-of-line.
> >
> >> Does this cater for WS at end of line without an explicit FS,
> >> or does this fall foul of previous point?
> >
> > You mean like a DOS-style file with <cr><nl> at the end of the line?
> > So far as I know this is properly handled by stripping away both line
> > termination characters on input.  But more testing wouldn't hurt.
> >
> > Ethan
> >
>
> No , I was not talking about CRLF end of line.
>
> It is quite common to have 'invisible' WS   after the last field and
> being the last field probably no FS.  This is especially the case if
> there was a comment :  WS to provide visual separation or align  comments:
>
> 1,2,3,-999             # last column data got lost in paper records !

A comment character is only valid at the start of a data line.
A trailing comment like the one you show will be treated as
extraneous garbage in the last field.

Prior to yesterday's change, if "missing" were set to "-999" then the
rest of the field it would be ignored because of the intervening whitespace.
Since today the "missing" test will fail because "-999" is not followed
immediately by a field separator.

Do you think it should revert to terminating the "missing" check
on whitespace?  That was the example I tried to give by showing
that
  set datafile missing "missing"
would catch both fields 2 and 3 in a line containing
   1, missing A, missing B, 4

> Once the # is replaced by #0 to truncate out the comment ,

Such replacement does not happen.

> this line  would fall foul of your new scheme I think.

Yes, it will fail.
So should I partially revert the change to restore checking for
"missing" only up to the first whitespace?

        Ethan




> I see no real reason not to remove the tailing WS , it seems a little
> odd to strip one end an not the other.
>
> CSV is pretty illegible at the best of times. If need to dump a
> spreadsheet to CSV I often separate with " , " to make the result a
> little easier to read afterwards.
>
> Unless I'm missing something , I don't see any reason or advantage to
> not stripping trailing WS.
>
>
> Not wishing to be finicky, but you seemed interesting is considering any
> corner cases.
>
> Peter.
>
>
>
>
> >
> >
> >
> > Ethan
> >
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity planning
> reports. http://sdm.link/zohomanageengine
> _______________________________________________
> gnuplot-beta mailing list
> [hidden email]
> Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

Plotter-2
On 17/06/16 00:11, Ethan A Merritt wrote:

> On Thursday, 16 June, 2016 20:46:00 [hidden email] wrote:
>> On 16/06/16 19:44, Ethan A Merritt wrote:
>>> On Thursday, 16 June, 2016 18:00:00 [hidden email] wrote:
>>>> On 16/06/16 17:23, sfeam wrote:
>>>>
>>>>
>>>>> -   The comparison is to a string, not a numerical value, so -99.00 ne -99.0 ne -99
>>>> Sounds reasonable. From the gnuplot POV it is a missing *string* ; if
>>>> the cvs is output by a spreadsheet or other software it seems reasonable
>>>> to expect consistent string formatting ( although Excel could have
>>>> different cell formats, that is probably too much to try and anticipate.
>>>>
>>>>
>>>>  >-   Leading whitespace is ignored but trailing whitespace is not.
>>>>
>>>> Seems inconsistent. Was this a programming convenience for minimal
>>>> coding changes or is there a functional logic behind this?
>>>
>>> I though the consensus from a couple of days ago was that any difference
>>> in the remainder of the field was significant, hence extra trailing
>>> characters would mean that the match was imperfect.
>>> Previously "missing A" and "missing B" were both matched as "missing".
>>> Now they are not, even if A is a <tab> or '\n' or '\r'.
>>
>> I don't know what the consensus was but my comment on that was that
>> trailing WS should be stripped, as it is with leading WS.  I was
>> suggesting that any non-WS following the missing string meant the match
>> failed. Specifically relating to your  " ignore A" case.  I did not
>> suggest  WS"ignore"WS should fail.
>>
>> If leading space is stripped, I'm not sure I see why trailing is not
>> also stripped.
>>
>>
>>
>>>
>>>>  > ... and requires that the next character is a field-terminator.
>>>>
>>>> I presume field-terminator.means FS or EOL.
>>>
>>> Separator or null.
>>> EOL is legal with in a csv field, although if you have such a file good
>>> luck to you. When a line of data is read in to gnuplot it is transferred
>>> to a null-terminated string, so the check for null should catch the true
>>> end-of-line.
>>>
>>>> Does this cater for WS at end of line without an explicit FS,
>>>> or does this fall foul of previous point?
>>>
>>> You mean like a DOS-style file with <cr><nl> at the end of the line?
>>> So far as I know this is properly handled by stripping away both line
>>> termination characters on input.  But more testing wouldn't hurt.
>>>
>>> Ethan
>>>
>>
>> No , I was not talking about CRLF end of line.
>>
>> It is quite common to have 'invisible' WS   after the last field and
>> being the last field probably no FS.  This is especially the case if
>> there was a comment :  WS to provide visual separation or align  comments:
>>
>> 1,2,3,-999             # last column data got lost in paper records !
>
> A comment character is only valid at the start of a data line.
> A trailing comment like the one you show will be treated as
> extraneous garbage in the last field.
>
> Prior to yesterday's change, if "missing" were set to "-999" then the
> rest of the field it would be ignored because of the intervening whitespace.
> Since today the "missing" test will fail because "-999" is not followed
> immediately by a field separator.
>
> Do you think it should revert to terminating the "missing" check
> on whitespace?  That was the example I tried to give by showing
> that
>   set datafile missing "missing"
> would catch both fields 2 and 3 in a line containing
>    1, missing A, missing B, 4
>
>> Once the # is replaced by #0 to truncate out the comment ,
>
> Such replacement does not happen.

Oops. It seems like I've been relying " extraneous garbage in the last
field" for my comments fro some time !  This may be why:

gnuplot> help comment
  Comments are supported as follows: a `#` may appear in most places in
a line
  and `gnuplot` will ignore the rest of the line.  It will not have this
effect
  inside quotes, inside numbers (including complex numbers), inside command
  substitutions, etc.  In short, it works anywhere it makes sense to work.

That behaviour should probably be consistent w.r.t changes in separator.
If  a `#` may appear in most places in a line will get cropped for WS
files , it should work for CSV files. The example data file line that I
gave should truncate in both formats.

Gnuplot is remarkably good at sorting out almost any file in what seems
like an intuitive way and that is something that I find quite impressive.

Is there any reason not to support comments at end of line by simply
changing # ( or whatever the commentchars are set to ) to #0 as I
incorrectly thought was being done?

Moving towards a consistent behaviour would seem preferable to reverting
because of this difference.


Can you comment on whether there is a reason to strip leading WS but not
strip trailing WS ( as is currently done )?  This seems a little odd and
is unlikely to be a combination that one would expect.

Unless there is a positive reason for this choice or downside that I'm
missing, It would seem more consistent to strip both ends.

It seems that this choice may have been motived by the comment issue and
maybe making the comment behaviour more consistent, as outlined above,
would neatly resolve both.

Peter.



>
>> this line  would fall foul of your new scheme I think.
>
> Yes, it will fail.
> So should I partially revert the change to restore checking for
> "missing" only up to the first whitespace?
>
> Ethan
>
>
>
>
>> I see no real reason not to remove the tailing WS , it seems a little
>> odd to strip one end an not the other.
>>
>> CSV is pretty illegible at the best of times. If need to dump a
>> spreadsheet to CSV I often separate with " , " to make the result a
>> little easier to read afterwards.
>>
>> Unless I'm missing something , I don't see any reason or advantage to
>> not stripping trailing WS.
>>
>>
>> Not wishing to be finicky, but you seemed interesting is considering any
>> corner cases.
>>
>> Peter.
>>
>>
>>
>>
>>>
>>>
>>>
>>> Ethan
>>>
>>
>>


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
Reply | Threaded
Open this post in threaded view
|

Re: conflict of separator and missing

sfeam
On Friday, 17 June, 2016 08:42:46 [hidden email] wrote:

> Can you comment on whether there is a reason to strip leading WS but not
> strip trailing WS ( as is currently done )?  This seems a little odd and
> is unlikely to be a combination that one would expect.
>
> Unless there is a positive reason for this choice or downside that I'm
> missing, It would seem more consistent to strip both ends.

You may be over-thinking this.
Gnuplot does no "stripping" or other pre-processing of the input line.
Successive fields are read by standard calls to the C library.

For a numeric field this is either atod() or sscanf().
In either case the C language formatted input routine
1) skips over any leading whitespace,
2) parses the number, and
3) stops at the first character that is not part of the number.
That next character could be anything.
In other words, skipping any leading whitespace and ignoring any
trailing garbage is all normal behaviour for the libc input routines.

For a string field (e.g. 'plot with labels') it's a bit more complicated.
In this case yes, unquoted leading and trailing whitespace is eventually
stripped but this happens at a later stage, not while parsing the input.
Also some escape-character sequences are translated.

For detection of a "missing" flag? Well, that's what we're discussing.
This thread started with the example of using a numeric value as a
"missing" flag, which adds an additional layer of ambiguity.  
If you think of it as substituting for a number, then trailing characters
should be ignored.  If you think of it as a string, then trailing
characters including whitespace are potentially significant.  If it
really were a string then at a later stage any trailing whitespace would
be deleted but nothing in the current input layer (datafile.c) does this,
and this is the layer that has to decide whether the current record
is missing or not.

> It seems that this choice may have been motived by the comment issue and
> maybe making the comment behaviour more consistent, as outlined above,
> would neatly resolve both.

The only comment issue I see is that the documentation could be more
clear that it is talking about command lines rather than data files.
That much is easy to fix.

        Ethan

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
gnuplot-beta mailing list
[hidden email]
Membership management via: https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
12