[Bioc-sig-seq] Getting file names from list.files in a more useful order
Martin Morgan
mtmorgan at fhcrc.org
Thu Oct 8 05:08:56 CEST 2009
Hi Michael --
Michael Muratet wrote:
> Greetings
>
> I am working on adapting readIntensities from ShortRead to handle the
> new Illumina intensity file format, *.cif. Illumina has dropped the
> leading zeros from the file name so that if you use list.files to get
> file names from the old style you get:
>
> list.files(pattern="int.txt.p.gz")
> [1] "s_1_0001_int.txt.p.gz" "s_1_0002_int.txt.p.gz"
> "s_1_0003_int.txt.p.gz" "s_1_0004_int.txt.p.gz" "s_1_0005_int.txt.p.gz"
> [6] "s_1_0006_int.txt.p.gz" "s_1_0007_int.txt.p.gz"
> "s_1_0008_int.txt.p.gz" "s_1_0009_int.txt.p.gz" "s_1_0010_int.txt.p.gz"
> [11] "s_1_0011_int.txt.p.gz" "s_1_0012_int.txt.p.gz"
> "s_1_0013_int.txt.p.gz" "s_1_0014_int.txt.p.gz" "s_1_0015_int.txt.p.gz"
> [16] "s_1_0016_int.txt.p.gz" "s_1_0017_int.txt.p.gz"
> "s_1_0018_int.txt.p.gz" "s_1_0019_int.txt.p.gz" "s_1_0020_int.txt.p.gz"
>
> which puts everything in the order that one would like to read. I
> believe this is because the lexical sorting matches the arithmetic order
> of the tiles.
>
> The new scheme yields:
>
> list.files(pattern="cif")
> [1] "s_1_1.cif" "s_1_10.cif" "s_1_100.cif" "s_1_101.cif"
> "s_1_102.cif" "s_1_103.cif" "s_1_104.cif" "s_1_105.cif" "s_1_106.cif"
> [10] "s_1_107.cif" "s_1_108.cif" "s_1_109.cif" "s_1_11.cif"
> "s_1_110.cif" "s_1_111.cif" "s_1_112.cif" "s_1_113.cif" "s_1_114.cif"
> [19] "s_1_115.cif" "s_1_116.cif" "s_1_117.cif" "s_1_118.cif"
> "s_1_119.cif" "s_1_12.cif" "s_1_120.cif" "s_1_13.cif" "s_1_14.cif"
you could extract the lane and tile information along the lines of
files = c("s_1_1.cif", "s_1_10.cif")
lanes = as.integer(sub("s_([[:digit:]]+).*", "\\1", files))
tiles = as.integer(sub(".*_([[:digit:]]+).cif", "\\1", files))
and then order the files with
files[order(lanes, tiles)]
In earlier versions, I think the file name is actually configurable by
the pipeline software, and recorded in the xml configuration files; few
people seemed to actually do this though.
> which complicates building the requisite data structures because it's
> not in tile order.
>
> The new convention is further complicated by the fact that the intensity
> files are now arranged in sub-folders by cycle and lane.
>
> I could buffer everything until it's all read and then organize it
> appropriately, but it seems like it would much simpler if I could get
> the vector into tile order instead of lexical order. I don't see a
> command or other simple way to do this, but I'm hoping someone will be
> able to offer a suggestion. Anybody have any ideas?
>
> Thanks
>
> Mike
>
>
>
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-sig-sequencing
mailing list