[Bioc-sig-seq] Getting file names from list.files in a more useful order
Michael Muratet
mmuratet at hudsonalpha.org
Wed Oct 7 23:05:59 CEST 2009
Greetings
I am working on adapting readIntensities from ShortRead to handle the
new Illumina intensity file format, *.cif. Illumina has dropped the
leading zeros from the file name so that if you use list.files to get
file names from the old style you get:
list.files(pattern="int.txt.p.gz")
[1] "s_1_0001_int.txt.p.gz" "s_1_0002_int.txt.p.gz"
"s_1_0003_int.txt.p.gz" "s_1_0004_int.txt.p.gz" "s_1_0005_int.txt.p.gz"
[6] "s_1_0006_int.txt.p.gz" "s_1_0007_int.txt.p.gz"
"s_1_0008_int.txt.p.gz" "s_1_0009_int.txt.p.gz" "s_1_0010_int.txt.p.gz"
[11] "s_1_0011_int.txt.p.gz" "s_1_0012_int.txt.p.gz"
"s_1_0013_int.txt.p.gz" "s_1_0014_int.txt.p.gz" "s_1_0015_int.txt.p.gz"
[16] "s_1_0016_int.txt.p.gz" "s_1_0017_int.txt.p.gz"
"s_1_0018_int.txt.p.gz" "s_1_0019_int.txt.p.gz" "s_1_0020_int.txt.p.gz"
which puts everything in the order that one would like to read. I
believe this is because the lexical sorting matches the arithmetic
order of the tiles.
The new scheme yields:
list.files(pattern="cif")
[1] "s_1_1.cif" "s_1_10.cif" "s_1_100.cif" "s_1_101.cif"
"s_1_102.cif" "s_1_103.cif" "s_1_104.cif" "s_1_105.cif" "s_1_106.cif"
[10] "s_1_107.cif" "s_1_108.cif" "s_1_109.cif" "s_1_11.cif"
"s_1_110.cif" "s_1_111.cif" "s_1_112.cif" "s_1_113.cif" "s_1_114.cif"
[19] "s_1_115.cif" "s_1_116.cif" "s_1_117.cif" "s_1_118.cif"
"s_1_119.cif" "s_1_12.cif" "s_1_120.cif" "s_1_13.cif" "s_1_14.cif"
which complicates building the requisite data structures because it's
not in tile order.
The new convention is further complicated by the fact that the
intensity files are now arranged in sub-folders by cycle and lane.
I could buffer everything until it's all read and then organize it
appropriately, but it seems like it would much simpler if I could get
the vector into tile order instead of lexical order. I don't see a
command or other simple way to do this, but I'm hoping someone will be
able to offer a suggestion. Anybody have any ideas?
Thanks
Mike
More information about the Bioc-sig-sequencing
mailing list