split large xml files
am 10.08.2007 10:33:26 von Bob Bedford
Hi all,
I've an XML file that takes more than the hosting time limit to be readed by
a PHP script.
What I'd like to do is split the large XML file (can be more than 30MB) in
little parts and keep the header for every file.
Here is the idea:
....
The only change is the amount of "info" available. What I'd like is to split
the file to create littles ones whit the same
datas but each
with less tags (say limited to 3 for every file).
It's there any simple way ? This will only be done if the file is bigger
than 1MB
Bob
Re: split large xml files
am 10.08.2007 11:21:34 von unknown
Post removed (X-No-Archive: yes)
Re: split large xml files
am 10.08.2007 11:34:21 von Bob Bedford
> $xml = simplexml_load_file($xmlFile);
> And take it from there. Have a quick read of the simplexml docs. You
> should
> have your solution in very little time.
>
Thanks for replying....
after a quick search, I've to say I'm still in PHP 4 !!! damn !!!
Re: split large xml files
am 10.08.2007 11:53:48 von p.lepin
Bob Bedford wrote in
<46bc22e5$0$3808$5402220f@news.sunrise.ch>:
> What I'd like to do is split the large XML file (can be
> more than 30MB) in little parts and keep the header for
> every file.
>
>
>
>
>
>
>
>
>
>
> ...
>
>
> The only change is the amount of "info" available. What
> I'd like is to split the file to create littles ones whit
> the same datas but each with less
> tags (say limited to 3 for every file).
error_reporting (E_ALL | E_STRICT) ;
define ('MAX_ITEMS' , 3) ; $infoList = array () ;
$doc = new DOMDocument () ; $doc->load ('split.xml') ;
foreach
(
$infos = $doc->getElementsByTagName ('info') as $info
)
$infoList [] = $info->cloneNode (TRUE) ;
for ($i = $infos->length - 1 ; $i >= 0 ; -- $i)
$infos->item ($i)->parentNode->removeChild
($infos->item ($i)) ;
for ($i = 1 ; count ($infoList) ; ++ $i)
{
$curDoc = new DOMDocument () ;
$curDoc->appendChild
(
$curDoc->importNode ($doc->documentElement , TRUE)
) ;
for ($j = MAX_ITEMS ; $j && $info ; -- $j)
if (! ($info = array_shift ($infoList))) break ;
else
$curDoc->documentElement->appendChild
($curDoc->importNode ($info , TRUE)) ;
$curDoc->save ('split_' . $i . '.xml') ;
}
?>
--
"Patience is a minor form of despair, disguised as
virtue." -- Ambrose Bierce
Re: split large xml files
am 10.08.2007 13:24:34 von Bob Bedford
Hem, what to say more than thank you !!!
I'll implement it...thanks
Re: split large xml files
am 10.08.2007 13:32:18 von gosha bine
On 10.08.2007 11:21 David Gillen wrote:
> Bob Bedford said:
>> Hi all,
>>
>> I've an XML file that takes more than the hosting time limit to be readed by
>> a PHP script.
>>
>> What I'd like to do is split the large XML file (can be more than 30MB) in
>> little parts and keep the header for every file.
>>
>> Here is the idea:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ...
>>
>>
>> The only change is the amount of "info" available. What I'd like is to split
>> the file to create littles ones whit the same
datas but each
>> with less tags (say limited to 3 for every file).
>>
>> It's there any simple way ? This will only be done if the file is bigger
>> than 1MB
>>
> $xml = simplexml_load_file($xmlFile);
> And take it from there. Have a quick read of the simplexml docs. You should
> have your solution in very little time.
>
Didn't test it, but I doubt simplexml would be able to load a 30MB xml
file. I think OP's best option is to use the tool that can read and
parse in small chunks, like expat (see
http://www.php.net/manual/en/function.xml-parse.php)
--
gosha bine
makrell ~ http://www.tagarga.com/blok/makrell
php done right ;) http://code.google.com/p/pihipi
Re: split large xml files
am 10.08.2007 17:12:58 von Michael Fesser
..oO(Pavel Lepin)
>
> error_reporting (E_ALL | E_STRICT) ;
> define ('MAX_ITEMS' , 3) ; $infoList = array () ;
> $doc = new DOMDocument () ; $doc->load ('split.xml') ;
> foreach
> (
> $infos = $doc->getElementsByTagName ('info') as $info
> )
> $infoList [] = $info->cloneNode (TRUE) ;
>[...]
Uh ... that's a very unconventional and confusing coding style with all
these blanks and unexpected line breaks hanging around ...
Micha
Re: split large xml files
am 10.08.2007 20:23:05 von Cristian Cotovan
On Aug 10, 2:34 am, "Bob Bedford" wrote:
> > $xml = simplexml_load_file($xmlFile);
> > And take it from there. Have a quick read of the simplexml docs. You
> > should
> > have your solution in very little time.
>
> Thanks for replying....
> after a quick search, I've to say I'm still in PHP 4 !!! damn !!!
If you have files that big, simple xml is not an option, because the
memory will run out, and simple xml reads the whole file in memory and
makes a copy of it. What you really want is xml parsing in "streaming"
or "pull parsing" mode. You can read about it here:
http://www.ibm.com/developerworks/xml/library/x-pullparsingp hp.html?ca=dgr-lnxw06XMLReader
However, I guess this is also not very helpful since you're running
PHP 4 and XMLReader has been introduced in PHP5. I am fighting this at
this moment also (with no solution yet), as I have to parse huge ONIX
files from book publishers (some are 90 Mb!). Let me know if you get
lucky.
Re: split large xml files
am 13.08.2007 10:24:20 von p.lepin
Michael Fesser wrote in
:
> .oO(Pavel Lepin)
>
>>
>> error_reporting (E_ALL | E_STRICT) ;
>> define ('MAX_ITEMS' , 3) ; $infoList = array () ;
>> $doc = new DOMDocument () ; $doc->load ('split.xml') ;
>> foreach
>> (
>> $infos = $doc->getElementsByTagName ('info') as $info
>> )
>> $infoList [] = $info->cloneNode (TRUE) ;
>>[...]
>
> Uh ... that's a very unconventional and confusing coding
> style with all these blanks and unexpected line breaks
> hanging around ...
And your point is..?
--
"Patience is a minor form of despair, disguised as
virtue." -- Ambrose Bierce
Re: split large xml files
am 14.08.2007 15:42:24 von Michael Fesser
..oO(Pavel Lepin)
>And your point is..?
Exactly what I said. The posted code doesn't follow any coding
guidelines and is _very_ hard to read and understand.
Micha
Re: split large xml files
am 14.08.2007 16:38:01 von p.lepin
Michael Fesser wrote in
<45c3c3l43v5mf2uq6r88tsmb8m93pdbik9@4ax.com>:
> .oO(Pavel Lepin)
>>And your point is..?
>
> Exactly what I said. The posted code doesn't follow any
> coding guidelines
The code I posted follows the PHP coding style guidelines
(the variant for short code snippets in our dev dept's CMS)
of the organisation I'm working for. I don't think I should
snap out of my habits (that weren't all that easy to
develop to boot, since the coding style I personally prefer
uses *way* more whitespace that the snippet in my OP) just
for the sake of your ease of understanding. Not only you
aren't signing my paychecks, other people might actually
find the code easier to read in the style I used, so no
reason to give you any preference.
> and is _very_ hard to read and understand.
I find the coding style promoted by Zend IDE ugly and hard
to parse even with syntax highlighting, let alone by naked
eye. It's a matter of perception, and if you believe
there's any sort of consensus on preferable coding style
even in PHP community alone, you're sadly mistaken.
--
"Patience is a minor form of despair, disguised as
virtue." -- Ambrose Bierce
Re: split large xml files
am 14.08.2007 17:06:21 von Michael Fesser
..oO(Pavel Lepin)
>The code I posted follows the PHP coding style guidelines
>(the variant for short code snippets in our dev dept's CMS)
>of the organisation I'm working for.
I've seen many coding guidelines (for PHP, C/C++, Java, Pascal etc.),
but it's the first time I came across something like you posted. Just
two examples:
foreach
(
...
)
vs.
for (...)
Where's the logic in that? Sometimes a line break in a control
structure, sometimes not? Same here:
$curDoc->appendChild
(
$curDoc->importNode ($doc->documentElement , TRUE)
) ;
vs.
$curDoc->documentElement->appendChild
($curDoc->importNode ($info , TRUE)) ;
Illogical (IMHO).
>other people might actually
>find the code easier to read in the style I used
I really doubt that, but YMMV. For example it's quite common to _not_
put a blank between a function name and its arguments, simply because it
can be confused with a control structure or a property in OOP. If you
like that - OK. But you should also think about other coders that might
read your code. Especially about inexperienced coders who are asking for
help in a newsgroup.
>It's a matter of perception, and if you believe
>there's any sort of consensus on preferable coding style
>even in PHP community alone, you're sadly mistaken.
Of course there's not the one and only coding style (and never will),
but there some very basic rules, which are a part of most if not all
guidelines. Call it common sense.
Micha
Re: split large xml files
am 15.08.2007 08:34:03 von p.lepin
Michael Fesser wrote in
:
> .oO(Pavel Lepin)
>
>>The code I posted follows the PHP coding style guidelines
>>(the variant for short code snippets in our dev dept's
>>CMS) of the organisation I'm working for.
>
> I've seen many coding guidelines (for PHP, C/C++, Java,
> Pascal etc.), but it's the first time I came across
> something like you posted. Just two examples:
>
> foreach
> (
> ...
> )
>
> vs.
>
> for (...)
>
> Where's the logic in that?
I don't think it would be wise of me to answer this. It
would quickly get religious.
> Sometimes a line break in a control structure, sometimes
> not? Same here:
>
> $curDoc->appendChild
> (
> $curDoc->importNode ($doc->documentElement , TRUE)
> ) ;
>
> vs.
>
> $curDoc->documentElement->appendChild
> ($curDoc->importNode ($info , TRUE)) ;
>
> Illogical (IMHO).
The coding standard I'm referring to sets the maximum line
length to 78 chars; for posting to usenet I use 60 chars,
but the guideline remains the same--the second option is
used when the argument list or conditional fits one line.
Otherwise, the first option is used. Again, I'll leave the
reasoning out of this to avoid inciting YAHW.
>>other people might actually find the code easier to read
>>in the style I used
>
> I really doubt that, but YMMV. For example it's quite
> common to _not_ put a blank between a function name and
> its arguments, simply because it can be confused with a
> control structure or a property in OOP. If you like that -
> OK. But you should also think about other coders that
> might read your code. Especially about inexperienced
> coders who are asking for help in a newsgroup.
I think it's actually beneficial to neophytes to learn that
there are many coding style guidelines in the world, all of
them in conflict with each other; and that they might have
to adapt quickly, especially if they ever have to work on
several unrelated projects in consulting capacity. YMMV,
indeed.
>>It's a matter of perception, and if you believe
>>there's any sort of consensus on preferable coding style
>>even in PHP community alone, you're sadly mistaken.
>
> Of course there's not the one and only coding style (and
> never will), but there some very basic rules, which are a
> part of most if not all guidelines.
Yes, there are some, but any two people are never going to
agree *which ones* are.
> Call it common sense.
Oh, please. Don't 'should be obvious' me on something which
is too much a matter taste and habit.
--
"Patience is a minor form of despair, disguised as
virtue." -- Ambrose Bierce