Quantcast

Is it possible to append to an already existing avro file

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Is it possible to append to an already existing avro file

Vyacheslav Zholudev
Hi, 

is it possible to append to an already existing avro file when it was written and closed before?

If I use
outputStream = fs.append(avroFilePath);

then later on I get: java.io.IOException: Invalid sync!

Probably because the schema is written twice and some other issues. 

If I use outputStream = fs.create(avroFilePath); then the avro file gets overwritten. 

Thanks,
Vyacheslav
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it possible to append to an already existing avro file

Harsh J-2
Hi,

Use the appendTo feature of the DataFileWriter. See
http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File)

For a quick setup example, read also:
http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-existing-avro-data-file

On Tue, Feb 21, 2012 at 3:15 AM, Vyacheslav Zholudev
<[hidden email]> wrote:

> Hi,
>
> is it possible to append to an already existing avro file when it was
> written and closed before?
>
> If I use
> outputStream = fs.append(avroFilePath);
>
> then later on I get: java.io.IOException: Invalid sync!
>
> Probably because the schema is written twice and some other issues.
>
> If I useĀ outputStream = fs.create(avroFilePath); then the avro file gets
> overwritten.
>
> Thanks,
> Vyacheslav



--
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it possible to append to an already existing avro file

Vyacheslav Zholudev
Yep, I saw that method as well as the stackoverflow post. However, I'm interested how to append to a file on the arbitrary file system, not only on the local one.

I want to get an OutputStream based on the Path and the FileSystem implementation and then pass it for appending to avro methods.

Is that possible?

Thanks,
Vyacheslav

On Feb 21, 2012, at 5:29 AM, Harsh J wrote:

> Hi,
>
> Use the appendTo feature of the DataFileWriter. See
> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileWriter.html#appendTo(java.io.File)
>
> For a quick setup example, read also:
> http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-existing-avro-data-file
>
> On Tue, Feb 21, 2012 at 3:15 AM, Vyacheslav Zholudev
> <[hidden email]> wrote:
>> Hi,
>>
>> is it possible to append to an already existing avro file when it was
>> written and closed before?
>>
>> If I use
>> outputStream = fs.append(avroFilePath);
>>
>> then later on I get: java.io.IOException: Invalid sync!
>>
>> Probably because the schema is written twice and some other issues.
>>
>> If I use outputStream = fs.create(avroFilePath); then the avro file gets
>> overwritten.
>>
>> Thanks,
>> Vyacheslav
>
>
>
> --
> Harsh J
> Customer Ops. Engineer
> Cloudera | http://tiny.cloudera.com/about

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it possible to append to an already existing avro file

Scott Carey-2

On 2/21/12 7:29 AM, "Vyacheslav Zholudev" <[hidden email]>
wrote:

>Yep, I saw that method as well as the stackoverflow post. However, I'm
>interested how to append to a file on the arbitrary file system, not only
>on the local one.
>
>I want to get an OutputStream based on the Path and the FileSystem
>implementation and then pass it for appending to avro methods.
>
>Is that possible?

It is not possible without modifying DataFileWriter. Please open a JIRA
ticket.  

It could not simply append to an OutputStream, since it must either:
* Seek to the start to validate the schemas match and find the sync
marker, or
* Trust that the schemas match and find the sync marker from the last block

DataFileWriter cannot refer to Hadoop classes such as FileSystem, but we
could add something to the mapred module that takes a Path and FileSystem
and returns
something that implemements an interface that DataFileWriter can append
to.  This would be something that is both a
http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInp
ut.html
and an OutputStream, or has both an InputStream from the start of the
existing file and an OutputStream at the end.




>
>Thanks,
>Vyacheslav
>
>On Feb 21, 2012, at 5:29 AM, Harsh J wrote:
>
>> Hi,
>>
>> Use the appendTo feature of the DataFileWriter. See
>>
>>http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileW
>>riter.html#appendTo(java.io.File)
>>
>> For a quick setup example, read also:
>>
>>http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-exis
>>ting-avro-data-file
>>
>> On Tue, Feb 21, 2012 at 3:15 AM, Vyacheslav Zholudev
>> <[hidden email]> wrote:
>>> Hi,
>>>
>>> is it possible to append to an already existing avro file when it was
>>> written and closed before?
>>>
>>> If I use
>>> outputStream = fs.append(avroFilePath);
>>>
>>> then later on I get: java.io.IOException: Invalid sync!
>>>
>>> Probably because the schema is written twice and some other issues.
>>>
>>> If I use outputStream = fs.create(avroFilePath); then the avro file
>>>gets
>>> overwritten.
>>>
>>> Thanks,
>>> Vyacheslav
>>
>>
>>
>> --
>> Harsh J
>> Customer Ops. Engineer
>> Cloudera | http://tiny.cloudera.com/about
>


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it possible to append to an already existing avro file

Vyacheslav Zholudev
Thanks for your reply, I suspected this.

I will create a JIRA ticket.

Vyacheslav

On Feb 21, 2012, at 6:02 PM, Scott Carey wrote:

>
> On 2/21/12 7:29 AM, "Vyacheslav Zholudev" <[hidden email]>
> wrote:
>
>> Yep, I saw that method as well as the stackoverflow post. However, I'm
>> interested how to append to a file on the arbitrary file system, not only
>> on the local one.
>>
>> I want to get an OutputStream based on the Path and the FileSystem
>> implementation and then pass it for appending to avro methods.
>>
>> Is that possible?
>
> It is not possible without modifying DataFileWriter. Please open a JIRA
> ticket.  
>
> It could not simply append to an OutputStream, since it must either:
> * Seek to the start to validate the schemas match and find the sync
> marker, or
> * Trust that the schemas match and find the sync marker from the last block
>
> DataFileWriter cannot refer to Hadoop classes such as FileSystem, but we
> could add something to the mapred module that takes a Path and FileSystem
> and returns
> something that implemements an interface that DataFileWriter can append
> to.  This would be something that is both a
> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/SeekableInp
> ut.html
> and an OutputStream, or has both an InputStream from the start of the
> existing file and an OutputStream at the end.
>
>
>
>
>>
>> Thanks,
>> Vyacheslav
>>
>> On Feb 21, 2012, at 5:29 AM, Harsh J wrote:
>>
>>> Hi,
>>>
>>> Use the appendTo feature of the DataFileWriter. See
>>>
>>> http://avro.apache.org/docs/1.6.2/api/java/org/apache/avro/file/DataFileW
>>> riter.html#appendTo(java.io.File)
>>>
>>> For a quick setup example, read also:
>>>
>>> http://stackoverflow.com/questions/8806689/can-you-append-data-to-an-exis
>>> ting-avro-data-file
>>>
>>> On Tue, Feb 21, 2012 at 3:15 AM, Vyacheslav Zholudev
>>> <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>> is it possible to append to an already existing avro file when it was
>>>> written and closed before?
>>>>
>>>> If I use
>>>> outputStream = fs.append(avroFilePath);
>>>>
>>>> then later on I get: java.io.IOException: Invalid sync!
>>>>
>>>> Probably because the schema is written twice and some other issues.
>>>>
>>>> If I use outputStream = fs.create(avroFilePath); then the avro file
>>>> gets
>>>> overwritten.
>>>>
>>>> Thanks,
>>>> Vyacheslav
>>>
>>>
>>>
>>> --
>>> Harsh J
>>> Customer Ops. Engineer
>>> Cloudera | http://tiny.cloudera.com/about
>>
>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it possible to append to an already existing avro file

TrevniUser
In reply to this post by Harsh J-2
I was following this thread for a problem I am facing while using SortedKeyValueFiles.

Below is the piece of code that tries to obtain the appropriate writer based on whether I am appending or creating a new file:

OutputStream dataOutputStream;
            if (!fileSystem.exists(dataFilePath)) {
                dataOutputStream = fileSystem.create(dataFilePath);
                mDataFileWriter = new DataFileWriter<GenericRecord>(datumWriter).setSyncInterval(1 << 20).create(mRecordSchema, dataOutputStream);
            } else {
                dataOutputStream = fileSystem.append(dataFilePath);
                mDataFileWriter = new DataFileWriter<GenericRecord>(datumWriter).setSyncInterval(1 << 20).appendTo(new File(options.getPath() + DATA_FILENAME));
            }

but it fails with this:

java.io.FileNotFoundException: /CHANGELOG/data (No such file or directory)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:120)
        at org.apache.avro.file.SeekableFileInput.<init>(SeekableFileInput.java:29)
        at org.apache.avro.file.DataFileWriter.appendTo(DataFileWriter.java:149)
        at com.abc.kepler.datasink.hdfs.util.SortedKeyValueFile$Writer.<init>(SortedKeyValueFile.java:597)
        at com.abc.kepler.datasink.hdfs.util.ChangeLogUtil.getChangeLogWriter(ChangeLogUtil.java:84)
        at com.abc.kepler.datasink.hdfs.HDFSDataSinkChangeLog.append(HDFSDataSinkChangeLog.java:219)
        at com.abc.kepler.datasink.hdfs.HDFSDataSinkChangesTest.writeDataSingleEntityKeyDefaultLocation(HDFSDataSinkChangesTest.java:1036)
        at com.abc.kepler.datasink.hdfs.HDFSDataSinkChangesTest.javadocExampleTest(HDFSDataSinkChangesTest.java:645)

So, is the avro writer it not able to locate the file on hdfs? Could you please share some pointers what could be leading to this?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it possible to append to an already existing avro file

Doug Cutting
Since the exception is thrown from java.io.FileInputStream#open, it's
trying to append to a local file, not one in HDFS.

You're passing 'new File(...)' to appendTo, when you should probably
be passing 'new FsInput(...)'.

Doug

On Mon, Jul 8, 2013 at 9:29 AM, TrevniUser <[hidden email]> wrote:

> I was following this thread for a problem I am facing while using
> SortedKeyValueFiles.
>
> Below is the piece of code that tries to obtain the appropriate writer based
> on whether I am appending or creating a new file:
>
> OutputStream dataOutputStream;
>             if (!fileSystem.exists(dataFilePath)) {
>                 dataOutputStream = fileSystem.create(dataFilePath);
>                 mDataFileWriter = new
> DataFileWriter<GenericRecord>(datumWriter).setSyncInterval(1 <<
> 20).create(mRecordSchema, dataOutputStream);
>             } else {
>                 dataOutputStream = fileSystem.append(dataFilePath);
>                 mDataFileWriter = new
> DataFileWriter<GenericRecord>(datumWriter).setSyncInterval(1 <<
> 20).appendTo(new File(options.getPath() + DATA_FILENAME));
>             }
>
> but it fails with this:
>
> java.io.FileNotFoundException: /CHANGELOG/data (No such file or directory)
>         at java.io.FileInputStream.open(Native Method)
>         at java.io.FileInputStream.<init>(FileInputStream.java:120)
>         at org.apache.avro.file.SeekableFileInput.<init>(SeekableFileInput.java:29)
>         at org.apache.avro.file.DataFileWriter.appendTo(DataFileWriter.java:149)
>         at
> com.abc.kepler.datasink.hdfs.util.SortedKeyValueFile$Writer.<init>(SortedKeyValueFile.java:597)
>         at
> com.abc.kepler.datasink.hdfs.util.ChangeLogUtil.getChangeLogWriter(ChangeLogUtil.java:84)
>         at
> com.abc.kepler.datasink.hdfs.HDFSDataSinkChangeLog.append(HDFSDataSinkChangeLog.java:219)
>         at
> com.abc.kepler.datasink.hdfs.HDFSDataSinkChangesTest.writeDataSingleEntityKeyDefaultLocation(HDFSDataSinkChangesTest.java:1036)
>         at
> com.abc.kepler.datasink.hdfs.HDFSDataSinkChangesTest.javadocExampleTest(HDFSDataSinkChangesTest.java:645)
>
> So, is the avro writer it not able to locate the file on hdfs? Could you
> please share some pointers what could be leading to this?
>
>
>
> --
> View this message in context: http://apache-avro.679487.n3.nabble.com/Is-it-possible-to-append-to-an-already-existing-avro-file-tp3762049p4027785.html
> Sent from the Avro - Users mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it possible to append to an already existing avro file

TrevniUser
Thanks for replying. You are correct. I followed this example https://gist.github.com/QwertyManiac/4724582
Loading...