Asp Forum - filters - comp.lang.c

lee

6/18/2011 10:15:00 AM

Hi,

itÂ´s probably a FAQ, though I havenÂ´t found any good info on it: How do
you go about writing a filter, i. e. a program that reads data from
stdin and processes it?

Since indefinte amounts of data could be read from stdin, they canÂ´t
just be put into some (ever increasing) buffer. Using a buffer of
limited size can make it difficult to process the data because the
buffer could be too small.

WhatÂ´s the solution for this?

23 Answers

ram

6/18/2011 10:33:00 AM

lee <lee@yun.yagibdah.de> writes:
>Since indefinte amounts of data could be read from stdin, they can't
>just be put into some (ever increasing) buffer. Using a buffer of
>limited size can make it difficult to process the data because the
>buffer could be too small.
>What's the solution for this?

(There has not been a programming problem specified yet.)

Sometimes, you do not need to store all the data
when processing it. This depends on the specific problem.

A C program that cannot get enough buffer memory
to process data using realloc might terminate
setting an appropriate exit code, possibly writing
an explanation to stderr or a log file.

You also could write the data read to a file, sometimes.
This might be slower, but might allow to process larger
input data.

When the life time of a computer is assumed to be 150 years
and one assumes that 1 Gibioctet can be read from stdin each
second, there will not be more than 5093144018288640000
octets read. So, one could simply buy that much memory
upfront before starting the program.

Ben Bacarisse

6/18/2011 10:38:00 AM

lee <lee@yun.yagibdah.de> writes:

> itÂ´s probably a FAQ, though I havenÂ´t found any good info on it: How do
> you go about writing a filter, i. e. a program that reads data from
> stdin and processes it?

This is not really a C question. I suggest to ask it in
comp.programming which deals with this sort of question. I've set
follow ups there. Maybe you had a C-specific question in mind. If so,
please ignore the followup-to header and reply with the more specific
question.

It's possible you haven't found anything because the question is very
general. Pretty much the only general thing to say about a filter is
what you already know -- that it reads stdin and writes to stdout.

> Since indefinte amounts of data could be read from stdin, they canÂ´t
> just be put into some (ever increasing) buffer.

You may have no choice. A sort filter can't produce any output until it
has seen all the input while other filters can finish before having seen
all the data (for example, Unix's head command). Some can compute
function that depend on arbitrarily long inputs without requiring
unbounded storage (for example Unix's wc command).

> Using a buffer of
> limited size can make it difficult to process the data because the
> buffer could be too small.
>
> WhatÂ´s the solution for this?

You make the buffer bigger. If that is not possible you have to find
some other way to store the temporary data. Of course, it is important
to know that you really do need to store the data. As an example, you
can write a filter that prints the variance of a set of numbers without
having to store them all -- the choice of algorithm is central.

--
Ben.

Francois Grieu

6/18/2011 11:57:00 AM

On 18/06/2011 12:15, lee ask:
> How do you go about writing a filter, i. e. a program that reads
> data from stdin and processes it?

#include <stdio.h>
int main(void)
{
int c;
while(EOF!=(c = getchar()))
{
if (c>='0'&&c<='9')
c = '9'-c+'0';
putchar(c);
}
}

As an aside, I wonder if
c = '9'-c+'0'
is safe, and think
c = '9'+'0'-c
is not.

Francois Grieu

James Kuyper

6/18/2011 2:15:00 PM

On 06/18/2011 06:15 AM, lee wrote:
> Hi,
>
> itÂ´s probably a FAQ, though I havenÂ´t found any good info on it: How do
> you go about writing a filter, i. e. a program that reads data from
> stdin and processes it?
>
> Since indefinte amounts of data could be read from stdin, they canÂ´t
> just be put into some (ever increasing) buffer. Using a buffer of
> limited size can make it difficult to process the data because the
> buffer could be too small.
>
> WhatÂ´s the solution for this?

The term "filter" is generally reserved for the kind of application that
only needs to keep a small portion of the input in memory at any given
time. A typical Unix filter is "cut". It parses each line of input up
into fields, which can be either fixed width or delimited by a
user-specifiable character which defaults to '\t'. It writes out a
specified subset of the fields, with the delimiter optionally replaced
with an arbitrary string. I've never attempted implementing it, but it
seems to me that it should be implementable in a way that never keeps
more than one character of input in memory at any given time.

However, if for some reason your program does need to store the entire
input, then you need expandable storage, and the C standard library
provides some. Start by allocating a buffer with malloc(). Whenever the
buffer gets full, call realloc() to expand it; I recommend increasing
the size by a fixed factor; 2 would be a good value. Note, there are a
couple of tricky points in connection with calling realloc():
* if realloc() fails, it returns a null pointer, and pointers into the
old buffer are still valid. Therefore, if you make the mistake of
storing the value returned by realloc() directly into the same pointer
you were using to keep track of your buffer, you'll lose your ability to
access that buffer if realloc() fails.
* if realloc() succeeds, it may have moved your data to a new location
in memory, invalidating any pointers you may have been keeping that
pointed into your old buffer. You can't safely do anything with any of
the old pointer values, not even comparing them for equality with new
ones to determine whether or not the buffer was moved. For each such
pointer, determine its offset from the beginning of the buffer before
calling realloc(). If realloc() succeeds, calculate the new value for
the corresponding pointer by adding that offset to the start of the new
buffer.

If realloc() does fail, you'll have to switch to a different approach.
One option is to create a temporary file using tmpfile(). Read from
standard input, then write to the temporary file. Once the entire file
is read in, you can move around in the temporary file using fseek(),
something you cannot do with stdin.
--
James Kuyper

John Doe

6/18/2011 2:25:00 PM

On Sat, 18 Jun 2011 12:15:01 +0200, lee wrote:

> itÂ´s probably a FAQ, though I havenÂ´t found any good info on it: How do
> you go about writing a filter, i. e. a program that reads data from
> stdin and processes it?
>
> Since indefinte amounts of data could be read from stdin, they canÂ´t
> just be put into some (ever increasing) buffer. Using a buffer of
> limited size can make it difficult to process the data because the
> buffer could be too small.
>
> WhatÂ´s the solution for this?

Read line, process line, write line. Replace "line" with whatever unit of
data is appropriate, but for a typical Unix filter, data is processed line
by line.

Many of the original Unix tools used fixed-size buffers; if an input line
was too large, the program would just terminate with an error. This wasn't
such a problem when creating a file with lines longer than 80 characters
was a feat in itself. Nowadays, such programs are more likely to realloc()
the buffer as required.

Bill Cunningham

6/18/2011 6:25:00 PM

Francois Grieu wrote:
> On 18/06/2011 12:15, lee ask:
>> How do you go about writing a filter, i. e. a program that reads
>> data from stdin and processes it?

I'd like to see if I can digest this code.

> #include <stdio.h>
> int main(void)
> {
> int c;
> while(EOF!=(c = getchar()))

If an input from keyboard isn't EOF

> {
> if (c>='0'&&c<='9')
> c = '9'-c+'0';

The above I can;t read. But this seems to be well put together.

> putchar(c);
> }
> }
>
>
> As an aside, I wonder if
> c = '9'-c+'0'
> is safe, and think
> c = '9'+'0'-c
> is not.
>
> Francois Grieu

cri

6/18/2011 6:42:00 PM

On Sat, 18 Jun 2011 11:37:52 +0100, Ben Bacarisse
<ben.usenet@bsb.me.uk> wrote:

>lee <lee@yun.yagibdah.de> writes:
>
>> itÂ´s probably a FAQ, though I havenÂ´t found any good info on it: How do
>> you go about writing a filter, i. e. a program that reads data from
>> stdin and processes it?
>
>This is not really a C question. I suggest to ask it in
>comp.programming which deals with this sort of question. I've set
>follow ups there. Maybe you had a C-specific question in mind. If so,
>please ignore the followup-to header and reply with the more specific
>question.

In my generally excellent opinion, it is a C question rather than a
general programming question. The concept of a filter (or function)
is general; the implementation issues are language dependent. See
James's excellent little article on the C specific issues.

IMGEO (:-)) people are often too quick to say something is a
programming question rather than a C question, perhaps because they
feel that the are about the technicalities of the C standards.
Present company excepted, of course.

Angel

6/18/2011 6:52:00 PM

On 2011-06-18, Bill Cunningham <nospam@nspam.invalid> wrote:
> Francois Grieu wrote:
>
>> {
>> if (c>='0'&&c<='9')
>> c = '9'-c+'0';
>
> The above I can't read. But this seems to be well put together.

Since c holds the result of a call to getc() and we already tested for
EOF, we know c is holding a character code, in whatever character set
the implementation is using.

This snippet of code tests if c holds the character code for a decimal
digit, and if so it "reverses" it, so that 0 becomes 9, 1 becomes 8 and
so on.

It's written in a way that will work regardless of which character set
the implementation uses, as long as the character codes for the decimal
digits form an unbroken sequence. This is true in at least ASCII (and
by extention, UTF8) and EBDIC, the two you are most likely to encounter.

--
"C provides a programmer with more than enough rope to hang himself.
C++ provides a firing squad, blindfold and last cigarette."
- seen in comp.lang.c

Lew Pitcher

6/18/2011 6:55:00 PM

On June 18, 2011 14:24, in comp.lang.c, nospam@nspam.invalid wrote:

> Francois Grieu wrote:
>> On 18/06/2011 12:15, lee ask:
>>> How do you go about writing a filter, i. e. a program that reads
>>> data from stdin and processes it?
>
> I'd like to see if I can digest this code.
>
>> #include <stdio.h>
>> int main(void)
>> {
>> int c;
>> while(EOF!=(c = getchar()))
>
> If an input from keyboard isn't EOF

WHILE stdin returns valid input (as opposed to an END-OF-FILE condition)

>> {
>> if (c>='0'&&c<='9')
>> c = '9'-c+'0';
>
> The above I can;t read. But this seems to be well put together.

Take it in bits.

Because of the if() statement, we know that c holds a value between '0'
and '9' inclusive. In other words, c contains a numeric digit.

The expression
'9' - c
finds out how many digits there are between c and 9
If c == '0', then '9' - c results in a value of 9
If c == '1', then '9' - c results in a value of 8
If c == '2', then '9' - c results in a value of 7
and so on, to
If c == '9', then '9' - c results in a value of 0

Now, this value is then added to '0', resulting in a numeric digit
character.

If '9' - c results in a value of 9,
then '9' - c + '0' results in a value of '9'
If '9' - c results in a value of 8,
then '9' - c + '0' results in a value of '8'
and so on

The net result is that, for each digit input, the expression computes it's
decimal complement
Input '0' and get '9' out
Input '1' and get '8' out
Input '2' and get '7' out
and so on, to
Input '9' and get '0' out

>
>> putchar(c);
>> }
>> }
>>
>>
>> As an aside, I wonder if
>> c = '9'-c+'0'
>> is safe, and think
>> c = '9'+'0'-c
>> is not.

Certainly,
c = '9' - c + '0';
would be safer than
c = '9' + '0' - c;
even though both are algebraically identical.

Given the conditions of it's execution, the sub-expression
'9' + '0'
/might/ overflow an int, in some unknown characterset, while
'9' - c
would not.

OTOH, the compiler might just optimize away the contentious subexpressions
entirely; certainly '9' + '0' is a known value which might fit within an
integer. And, the compiler needs only order operations such that the result
would be the same as if it had not reordered operations, the compiler is
free to reorder
'9' - c + '0'
/or/
'9' + '0' - c
such that the constants are coalesced into a single value prior to object
code emission.

But,
'9' - c + '0'
is probably the safer expression when dealing with an unknown compiler.

--
Lew Pitcher
Master Codewright & JOAT-in-training | Registered Linux User #112576
Me: http://pitcher.digitalfr... | Just Linux: http://jus...
---------- Slackware - Because I know what I'm doing. ------

Keith Thompson

6/18/2011 6:56:00 PM

Angel <angel+news@spamcop.net> writes:
[...]
> It's written in a way that will work regardless of which character set
> the implementation uses, as long as the character codes for the decimal
> digits form an unbroken sequence. This is true in at least ASCII (and
> by extention, UTF8) and EBDIC, the two you are most likely to encounter.

And it's guaranteed by the C standard. (Letters, on the other hand, are
not guaranteed to be contiguous, and in fact they aren't in EBCDIC --
or, typically, in character sets larger than ASCII.)

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.ne...
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

comp.lang.c

filters

lee

ram