- using R Under development (unstable) (2025-12-18 r89199)
- using platform: x86_64-pc-linux-gnu
- R was compiled by
gcc-15 (Debian 15.2.0-11) 15.2.0
GNU Fortran (Debian 15.2.0-11) 15.2.0
- running under: Debian GNU/Linux forky/sid
- using session charset: UTF-8
- checking for file ‘nc/DESCRIPTION’ ... OK
- this is package ‘nc’ version ‘2025.3.24’
- checking CRAN incoming feasibility ... [0s/1s] OK
- checking package namespace information ... OK
- checking package dependencies ... OK
- checking if this is a source package ... OK
- checking if there is a namespace ... OK
- checking for executable files ... OK
- checking for hidden files and directories ... OK
- checking for portable file names ... OK
- checking for sufficient/correct file permissions ... OK
- checking serialization versions ... OK
- checking whether package ‘nc’ can be installed ... OK
See the install log for details.
- checking package directory ... OK
- checking for future file timestamps ... OK
- checking ‘build’ directory ... OK
- checking DESCRIPTION meta-information ... OK
- checking top-level files ... OK
- checking for left-over files ... OK
- checking index information ... OK
- checking package subdirectories ... OK
- checking code files for non-ASCII characters ... OK
- checking R files for syntax errors ... OK
- checking whether the package can be loaded ... [0s/0s] OK
- checking whether the package can be loaded with stated dependencies ... [0s/0s] OK
- checking whether the package can be unloaded cleanly ... [0s/0s] OK
- checking whether the namespace can be loaded with stated dependencies ... [0s/0s] OK
- checking whether the namespace can be unloaded cleanly ... [0s/0s] OK
- checking loading without being on the library search path ... [0s/1s] OK
- checking use of S3 registration ... OK
- checking dependencies in R code ... OK
- checking S3 generic/method consistency ... OK
- checking replacement functions ... OK
- checking foreign function calls ... OK
- checking R code for possible problems ... [4s/5s] OK
- checking Rd files ... [0s/1s] OK
- checking Rd metadata ... OK
- checking Rd line widths ... OK
- checking Rd cross-references ... OK
- checking for missing documentation entries ... OK
- checking for code/documentation mismatches ... OK
- checking Rd \usage sections ... OK
- checking Rd contents ... OK
- checking for unstated dependencies in examples ... OK
- checking installed files from ‘inst/doc’ ... OK
- checking files in ‘vignettes’ ... OK
- checking examples ... [1s/2s] ERROR
Running examples in ‘nc-Ex.R’ failed
The error most likely occurred in:
> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: capture_all_str
> ### Title: Capture all matches in a single subject string
> ### Aliases: capture_all_str
>
> ### ** Examples
>
>
> data.table::setDTthreads(1)
>
> chr.pos.vec <- c(
+ "chr10:213,054,000-213,055,000",
+ "chrM:111,000-222,000",
+ "this will not match",
+ NA, # neither will this.
+ "chr1:110-111 chr2:220-222") # two possible matches.
> keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
> ## By default elements of subject are treated as separate lines (and
> ## NAs are removed). Named arguments are used to create capture
> ## groups, and conversion functions such as keep.digits are used to
> ## convert the previously named group.
> int.pattern <- list("[0-9,]+", keep.digits)
> (match.dt <- nc::capture_all_str(
+ chr.pos.vec,
+ chrom="chr.*?",
+ ":",
+ chromStart=int.pattern,
+ "-",
+ chromEnd=int.pattern))
chrom chromStart chromEnd
<char> <int> <int>
1: chr10 213054000 213055000
2: chrM 111000 222000
3: chr1 110 111
4: chr2 220 222
> str(match.dt)
Classes ‘data.table’ and 'data.frame': 4 obs. of 3 variables:
$ chrom : chr "chr10" "chrM" "chr1" "chr2"
$ chromStart: int 213054000 111000 110 220
$ chromEnd : int 213055000 222000 111 222
- attr(*, ".internal.selfref")=<pointer: 0x56302a20f070>
>
> ## Extract all fields from each alignment block, using two regex
> ## patterns, then dcast.
> info.txt.gz <- system.file(
+ "extdata", "SweeD_Info.txt.gz", package="nc")
> info.vec <- readLines(info.txt.gz)
> info.vec[24:40]
[1] " Alignment 1" ""
[3] "\t\tChromosome:\t\tscaffold_0" "\t\tSequences:\t\t14"
[5] "\t\tSites:\t\t\t1670366" "\t\tDiscarded sites:\t1264068"
[7] "" "\t\tProcessing:\t\t155.53 seconds"
[9] "" "\t\tPosition:\t\t8.936200e+07"
[11] "\t\tLikelihood:\t\t4.105582e+02" "\t\tAlpha:\t\t\t6.616326e-06"
[13] "" ""
[15] " Alignment 2" ""
[17] "\t\tChromosome:\t\tscaffold_1"
> info.dt <- nc::capture_all_str(
+ sub("Alignment ", "//", info.vec),
+ "//",
+ alignment="[0-9]+",
+ fields="[^/]+")
> (fields.dt <- info.dt[, nc::capture_all_str(
+ fields,
+ "\t+",
+ variable="[^:]+",
+ ":\t*",
+ value=".*"),
+ by=alignment])
alignment variable value
<char> <char> <char>
1: 1 Chromosome scaffold_0
2: 1 Sequences 14
3: 1 Sites 1670366
4: 1 Discarded sites 1264068
5: 1 Processing 155.53 seconds
6: 1 Position 8.936200e+07
7: 1 Likelihood 4.105582e+02
8: 1 Alpha 6.616326e-06
9: 2 Chromosome scaffold_1
10: 2 Sequences 14
11: 2 Sites 1447008
12: 2 Discarded sites 1093595
13: 2 Processing 138.83 seconds
14: 2 Position 8.722482e+07
15: 2 Likelihood 2.531514e+02
16: 2 Alpha 1.031963e-05
17: 3 Chromosome scaffold_2
18: 3 Sequences 14
19: 3 Sites 1379975
20: 3 Discarded sites 1043204
21: 3 Processing 134.50 seconds
22: 3 Position 8.461182e+07
23: 3 Likelihood 2.945708e+02
24: 3 Alpha 8.684652e-06
25: 4 Chromosome scaffold_3
26: 4 Sequences 14
27: 4 Sites 1293978
28: 4 Discarded sites 988465
29: 4 Processing 120.76 seconds
30: 4 Position 4.182126e+07
31: 4 Likelihood 6.110444e+02
32: 4 Alpha 3.335514e-06
33: 5 Chromosome scaffold_4
34: 5 Sequences 14
35: 5 Sites 1319920
36: 5 Discarded sites 1011446
37: 5 Processing 126.99 seconds
38: 5 Position 6.978721e+07
39: 5 Likelihood 2.884914e+02
40: 5 Alpha 1.062780e-05
41: 6 Chromosome scaffold_5
42: 6 Sequences 14
43: 6 Sites 1295460
44: 6 Discarded sites 990655
45: 6 Processing 119.64 seconds
46: 6 Position 8.837822e+07
47: 6 Likelihood 3.304343e+02
48: 6 Alpha 7.572795e-06
49: 7 Chromosome scaffold_6
50: 7 Sequences 14
51: 7 Sites 1197964
52: 7 Discarded sites 908454
53: 7 Processing 115.17 seconds
54: 7 Position 3.444713e+07
55: 7 Likelihood 3.261829e+02
56: 7 Alpha 3.427719e-06
57: 8 Chromosome scaffold_7
58: 8 Sequences 14
59: 8 Sites 1315248
60: 8 Discarded sites 998530
61: 8 Processing 125.20 seconds
62: 8 Position 2.337819e+07
63: 8 Likelihood 4.023517e+02
64: 8 Alpha 5.350802e-06
65: 9 Chromosome scaffold_8
66: 9 Sequences 14
67: 9 Sites 1110658
68: 9 Discarded sites 845039
69: 9 Processing 109.15 seconds
70: 9 Position 8.152571e+07
71: 9 Likelihood 3.114815e+02
72: 9 Alpha 3.899136e-06
73: 10 Chromosome scaffold_9
74: 10 Sequences 14
75: 10 Sites 1091036
76: 10 Discarded sites 833765
77: 10 Processing 104.91 seconds
78: 10 Position 2.669453e+07
79: 10 Likelihood 1.829336e+02
80: 10 Alpha 8.380941e-06
alignment variable value
> (fields.wide <- data.table::dcast(fields.dt, alignment ~ variable))
Key: <alignment>
alignment Alpha Chromosome Discarded sites Likelihood Position
<char> <char> <char> <char> <char> <char>
1: 1 6.616326e-06 scaffold_0 1264068 4.105582e+02 8.936200e+07
2: 10 8.380941e-06 scaffold_9 833765 1.829336e+02 2.669453e+07
3: 2 1.031963e-05 scaffold_1 1093595 2.531514e+02 8.722482e+07
4: 3 8.684652e-06 scaffold_2 1043204 2.945708e+02 8.461182e+07
5: 4 3.335514e-06 scaffold_3 988465 6.110444e+02 4.182126e+07
6: 5 1.062780e-05 scaffold_4 1011446 2.884914e+02 6.978721e+07
7: 6 7.572795e-06 scaffold_5 990655 3.304343e+02 8.837822e+07
8: 7 3.427719e-06 scaffold_6 908454 3.261829e+02 3.444713e+07
9: 8 5.350802e-06 scaffold_7 998530 4.023517e+02 2.337819e+07
10: 9 3.899136e-06 scaffold_8 845039 3.114815e+02 8.152571e+07
Processing Sequences Sites
<char> <char> <char>
1: 155.53 seconds 14 1670366
2: 104.91 seconds 14 1091036
3: 138.83 seconds 14 1447008
4: 134.50 seconds 14 1379975
5: 120.76 seconds 14 1293978
6: 126.99 seconds 14 1319920
7: 119.64 seconds 14 1295460
8: 115.17 seconds 14 1197964
9: 125.20 seconds 14 1315248
10: 109.15 seconds 14 1110658
>
> ## Capture all csv tables in report -- the file name can be given as
> ## the subject to nc::capture_all_str, which calls readLines to get
> ## data to parse.
> (report.txt.gz <- system.file(
+ "extdata", "SweeD_Report.txt.gz", package="nc"))
[1] "/home/hornik/tmp/R.check/r-devel-gcc/Work/build/Packages/nc/extdata/SweeD_Report.txt.gz"
> (report.dt <- nc::capture_all_str(
+ report.txt.gz,
+ "//",
+ alignment="[0-9]+",
+ "\n",
+ csv="[^/]+"
+ )[, {
+ data.table::fread(text=csv)
+ }, by=alignment])
alignment Position Likelihood Alpha
<char> <num> <num> <num>
1: 1 700.0 4.637328e-03 2.763840e+02
2: 1 130585.6 3.781283e-01 8.490200e-04
3: 1 260471.2 3.602315e-02 4.691340e-03
4: 1 390356.9 7.618749e-01 5.377668e-04
5: 1 520242.5 2.979971e-08 1.411765e-01
---
9996: 10 82991564.8 8.051006e-03 1.357819e-03
9997: 10 83074967.8 7.048433e-03 1.825764e-03
9998: 10 83158370.8 1.012360e-07 7.999999e-03
9999: 10 83241773.8 3.977189e-08 9.999997e-01
10000: 10 83325174.0 3.980538e-08 1.200000e+03
>
> ## Join report with info fields.
> report.dt[fields.wide, on=.(alignment)]
alignment Position Likelihood Alpha i.Alpha Chromosome
<char> <num> <num> <num> <char> <char>
1: 1 700.0 4.637328e-03 2.763840e+02 6.616326e-06 scaffold_0
2: 1 130585.6 3.781283e-01 8.490200e-04 6.616326e-06 scaffold_0
3: 1 260471.2 3.602315e-02 4.691340e-03 6.616326e-06 scaffold_0
4: 1 390356.9 7.618749e-01 5.377668e-04 6.616326e-06 scaffold_0
5: 1 520242.5 2.979971e-08 1.411765e-01 6.616326e-06 scaffold_0
---
9996: 9 85297670.3 1.078915e-01 1.730811e-02 3.899136e-06 scaffold_8
9997: 9 85383396.6 2.282976e-02 2.002634e-02 3.899136e-06 scaffold_8
9998: 9 85469122.8 1.573487e+00 1.169200e-03 3.899136e-06 scaffold_8
9999: 9 85554849.1 6.892966e-02 5.344763e-03 3.899136e-06 scaffold_8
10000: 9 85640578.0 0.000000e+00 1.200000e+03 3.899136e-06 scaffold_8
Discarded sites i.Likelihood i.Position Processing Sequences
<char> <char> <char> <char> <char>
1: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14
2: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14
3: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14
4: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14
5: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14
---
9996: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14
9997: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14
9998: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14
9999: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14
10000: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14
Sites
<char>
1: 1670366
2: 1670366
3: 1670366
4: 1670366
5: 1670366
---
9996: 1110658
9997: 1110658
9998: 1110658
9999: 1110658
10000: 1110658
>
> ## parsing nbib citation file.
> (pmc.nbib <- system.file(
+ "extdata", "PMC3045577.nbib", package="nc"))
[1] "/home/hornik/tmp/R.check/r-devel-gcc/Work/build/Packages/nc/extdata/PMC3045577.nbib"
> blank <- "\n "
> pmc.dt <- nc::capture_all_str(
+ pmc.nbib,
+ Abbreviation="[A-Z]+",
+ " *- ",
+ value=list(
+ ".*",
+ list(blank, ".*"), "*"),
+ function(x)sub(blank, "", x))
> str(pmc.dt)
Classes ‘data.table’ and 'data.frame': 50 obs. of 2 variables:
$ Abbreviation: chr "PMID" "OWN" "STAT" "DCOM" ...
$ value : chr "21113027" "NLM" "MEDLINE" "20110512" ...
- attr(*, ".internal.selfref")=<pointer: 0x56302a20f070>
>
> ## What do the variable fields mean? It is explained on
> ## https://www.nlm.nih.gov/bsd/mms/medlineelements.html which has a
> ## local copy in this package (downloaded 18 Sep 2019).
> fields.html <- system.file(
+ "extdata", "MEDLINE_Fields.html", package="nc")
> if(interactive())browseURL(fields.html)
> fields.vec <- readLines(fields.html)
>
> ## It is pretty easy to capture fields and abbreviations if gsub
> ## used to remove some tags first.
> no.strong <- gsub("</?strong>", "", fields.vec)
> no.comments <- gsub("<!--.*?-->", "", no.strong)
> ## grep then capture_first_vec can be used if each desired row in
> ## the output comes from a single line of the input file.
> (h3.vec <- grep("<h3", no.comments, value=TRUE))
[1] "<h3><a id=\"ab\" name=\"ab\"></a>Abstract (AB)</h3>"
[2] "<h3><a id=\"ci\" name=\"ci\"></a>Copyright Information (CI)</h3>"
[3] "<h3><a id=\"ad\" name=\"ad\"></a>Affiliation (AD)</h3>"
[4] "<h3><a id=\"irad\" name=\"irad\"></a>Investigator Affiliation (IRAD)</h3>"
[5] "<h3><a id=\"aid\" name=\"aid\"></a>Article Identifier (AID)</h3>"
[6] "<h3><a id=\"au\" name=\"au\"></a>Author (AU)</h3>"
[7] "<h3><a id=\"auid\" name=\"auid\"></a>Author Identifier (AUID)</h3>"
[8] "<h3><a id=\"fau\" name=\"fau\"></a>Full Author (FAU)</h3>"
[9] "<h3><a id=\"cc2\" name=\"bti\"></a>Book Title (BTI)</h3>"
[10] "<h3><a id=\"cc4\" name=\"cti\"></a>Collection Title (CTI)</h3>"
[11] "<h3><a id=\"cc\" name=\"cc\"></a>Comments/Corrections (See fields and field tags listed below.)</h3>"
[12] "<h3><a id=\"coi\" name=\"coi\"></a>Conflict of Interest Statement (COIS)</h3>"
[13] "<h3><a id=\"cn\" name=\"cn\"></a>Corporate Author (CN)</h3>"
[14] "<h3><a id=\"dcom2\" name=\"crdt\"></a>Create Date (CRDT)</h3>"
[15] "<h3><a id=\"dcom\" name=\"dcom\"></a>Date Completed (DCOM)</h3>"
[16] "<h3><a id=\"da\" name=\"da\"></a>Date Created (DA)</h3>"
[17] "<h3><a id=\"lr\" name=\"lr\"></a>Date Last Revised (LR)</h3>"
[18] "<h3><a id=\"dep\" name=\"dep\"></a>Date of Electronic Publication (DEP)</h3>"
[19] "<h3><a id=\"dp\" name=\"dp\"></a>Date of Publication (DP)</h3>"
[20] "<h3><a id=\"edat2\" name=\"ed\"></a>Editor (ED) and Full Editor Name (FED)</h3>"
[21] "<h3><a id=\"edat3\" name=\"en\"></a>Edition (EN)</h3>"
[22] "<h3><a id=\"edat\" name=\"edat\"></a>Entrez Date (EDAT)</h3>"
[23] "<h3><a id=\"gs\" name=\"gs\"></a>Gene Symbol (GS): not currently input</h3>"
[24] "<h3><a id=\"gn\" name=\"gn\"></a>General Note (GN)</h3>"
[25] "<h3><a id=\"gr\" name=\"gr\"></a>Grant Number (GR)</h3>"
[26] "<h3><a id=\"ir\" name=\"ir\"></a>Investigator Name (IR) and Full Investigator Name (FIR)</h3>"
[27] "<h3><a id=\"is2\" name=\"isbn\"></a>ISBN (ISBN)</h3>"
[28] "<h3><a id=\"is\" name=\"is\"></a>ISSN (IS)</h3>"
[29] "<h3><a id=\"ip\" name=\"ip\"></a>Issue (IP)</h3>"
[30] "<h3><a id=\"ta\" name=\"ta\"></a>Journal Title Abbreviation (TA)</h3>"
[31] "<h3><a id=\"jt\" name=\"jt\"></a>Journal Title (JT)</h3>"
[32] "<h3><a id=\"la\" name=\"la\"></a>Language (LA)</h3>"
[33] "<h3><a id=\"la3\" name=\"lid\"></a>Location Identifier (LID)</h3>"
[34] "<h3><a id=\"la2\" name=\"mid\"></a>Manuscript Identifier (MID)</h3>"
[35] "<h3><a id=\"mhda\" name=\"mhda\"></a>MeSH Date (MHDA)</h3>"
[36] "<h3><a id=\"mh\" name=\"mh\"></a>MeSH Terms (MH)</h3>"
[37] "<h3><a id=\"jid\" name=\"jid\"></a>NLM Unique ID (JID)</h3>"
[38] "<h3><a id=\"rf\" name=\"rf\"></a>Number of References (RF)</h3>"
[39] "<h3><a id=\"oab\" name=\"oab\"></a>Other Abstract (OAB)</h3>"
[40] "<h3><a id=\"oci\" name=\"oci\"></a>Other Copyright Information (OCI)</h3>"
[41] "<h3><a id=\"oid\" name=\"oid\"></a>Other ID (OID)</h3>"
[42] "<h3><a id=\"ot\" name=\"ot\"></a>Other Term (OT)</h3>"
[43] "<h3><a id=\"oto\" name=\"oto\"></a>Other Term Owner (OTO)</h3>"
[44] "<h3><a id=\"own\" name=\"own\"></a>Owner (OWN)</h3>"
[45] "<h3><a id=\"pg\" name=\"pg\"></a>Pagination (PG)</h3>"
[46] "<h3><a id=\"ps\" name=\"ps\"></a>Personal Name as Subject (PS)</h3>"
[47] "<h3><a id=\"fps\" name=\"fps\"></a>Full Personal Name as Subject (FPS)</h3>"
[48] "<h3><a id=\"pl\" name=\"pl\"></a>Place of Publication (PL)</h3>"
[49] "<h3><a id=\"phst\" name=\"phst\"></a>Publication History Status (PHST)</h3>"
[50] "<h3><a id=\"pst\" name=\"pst\"></a>Publication Status (PST)</h3>"
[51] "<h3><a id=\"pt\" name=\"pt\"></a>Publication Type (PT)</h3>"
[52] "<h3><a id=\"pubm\" name=\"pubm\"></a>Publishing Model (PUBM)</h3>"
[53] "<h3><a id=\"pmid2\" name=\"pmc\"></a>PubMed Central Identifer (PMC)</h3>"
[54] "<h3><a id=\"pmid3\" name=\"pmcr\"></a>PubMed Central Release (PMCR)</h3>"
[55] "<h3><a id=\"pmid\" name=\"pmid\"></a>PubMed Unique Identifier (PMID)</h3>"
[56] "<h3><a id=\"rn\" name=\"rn\"></a>Registry Number/EC Number (RN)</h3>"
[57] "<h3><a id=\"nm\" name=\"nm\"></a>Substance Name (NM)</h3>"
[58] "<h3><a id=\"si\" name=\"si\"></a>Secondary Source ID (SI)</h3>"
[59] "<h3><a id=\"so\" name=\"so\"></a>Source (SO)</h3>"
[60] "<h3><a id=\"sfm\" name=\"sfm\"></a>Space Flight Mission (SFM)</h3>"
[61] "<h3><a id=\"stat\" name=\"stat\"></a>Status (STAT)</h3>"
[62] "<h3><a id=\"sb\" name=\"sb\"></a>Subset (SB)</h3>"
[63] "<h3><a id=\"ti\" name=\"ti\"></a>Title (TI)</h3>"
[64] "<h3><a id=\"tt\" name=\"tt\"></a>Transliterated Title (TT)</h3>"
[65] "<h3><a id=\"vi\" name=\"vi\"></a>Volume (VI)</h3>"
[66] "<h3><a id=\"cc3\" name=\"vti\"></a>Volume Title (VTI)</h3>"
> h3.pattern <- list(
+ nc::field("name", '="', '[^"]+'),
+ '"></a>',
+ fields.abbrevs="[^<]+")
> first.fields.dt <- nc::capture_first_vec(
+ h3.vec, h3.pattern)
> field.abbrev.pattern <- list(
+ Field=".*?",
+ " \\(",
+ Abbreviation="[^)]+",
+ "\\)",
+ "(?: and |$)?")
> (first.each.field <- first.fields.dt[, nc::capture_all_str(
+ fields.abbrevs, field.abbrev.pattern),
+ by=fields.abbrevs])
fields.abbrevs
<char>
1: Abstract (AB)
2: Copyright Information (CI)
3: Affiliation (AD)
4: Investigator Affiliation (IRAD)
5: Article Identifier (AID)
6: Author (AU)
7: Author Identifier (AUID)
8: Full Author (FAU)
9: Book Title (BTI)
10: Collection Title (CTI)
11: Comments/Corrections (See fields and field tags listed below.)
12: Conflict of Interest Statement (COIS)
13: Corporate Author (CN)
14: Create Date (CRDT)
15: Date Completed (DCOM)
16: Date Created (DA)
17: Date Last Revised (LR)
18: Date of Electronic Publication (DEP)
19: Date of Publication (DP)
20: Editor (ED) and Full Editor Name (FED)
21: Editor (ED) and Full Editor Name (FED)
22: Edition (EN)
23: Entrez Date (EDAT)
24: Gene Symbol (GS): not currently input
25: General Note (GN)
26: Grant Number (GR)
27: Investigator Name (IR) and Full Investigator Name (FIR)
28: Investigator Name (IR) and Full Investigator Name (FIR)
29: ISBN (ISBN)
30: ISSN (IS)
31: Issue (IP)
32: Journal Title Abbreviation (TA)
33: Journal Title (JT)
34: Language (LA)
35: Location Identifier (LID)
36: Manuscript Identifier (MID)
37: MeSH Date (MHDA)
38: MeSH Terms (MH)
39: NLM Unique ID (JID)
40: Number of References (RF)
41: Other Abstract (OAB)
42: Other Copyright Information (OCI)
43: Other ID (OID)
44: Other Term (OT)
45: Other Term Owner (OTO)
46: Owner (OWN)
47: Pagination (PG)
48: Personal Name as Subject (PS)
49: Full Personal Name as Subject (FPS)
50: Place of Publication (PL)
51: Publication History Status (PHST)
52: Publication Status (PST)
53: Publication Type (PT)
54: Publishing Model (PUBM)
55: PubMed Central Identifer (PMC)
56: PubMed Central Release (PMCR)
57: PubMed Unique Identifier (PMID)
58: Registry Number/EC Number (RN)
59: Substance Name (NM)
60: Secondary Source ID (SI)
61: Source (SO)
62: Space Flight Mission (SFM)
63: Status (STAT)
64: Subset (SB)
65: Title (TI)
66: Transliterated Title (TT)
67: Volume (VI)
68: Volume Title (VTI)
fields.abbrevs
Field Abbreviation
<char> <char>
1: Abstract AB
2: Copyright Information CI
3: Affiliation AD
4: Investigator Affiliation IRAD
5: Article Identifier AID
6: Author AU
7: Author Identifier AUID
8: Full Author FAU
9: Book Title BTI
10: Collection Title CTI
11: Comments/Corrections See fields and field tags listed below.
12: Conflict of Interest Statement COIS
13: Corporate Author CN
14: Create Date CRDT
15: Date Completed DCOM
16: Date Created DA
17: Date Last Revised LR
18: Date of Electronic Publication DEP
19: Date of Publication DP
20: Editor ED
21: Full Editor Name FED
22: Edition EN
23: Entrez Date EDAT
24: Gene Symbol GS
25: General Note GN
26: Grant Number GR
27: Investigator Name IR
28: Full Investigator Name FIR
29: ISBN ISBN
30: ISSN IS
31: Issue IP
32: Journal Title Abbreviation TA
33: Journal Title JT
34: Language LA
35: Location Identifier LID
36: Manuscript Identifier MID
37: MeSH Date MHDA
38: MeSH Terms MH
39: NLM Unique ID JID
40: Number of References RF
41: Other Abstract OAB
42: Other Copyright Information OCI
43: Other ID OID
44: Other Term OT
45: Other Term Owner OTO
46: Owner OWN
47: Pagination PG
48: Personal Name as Subject PS
49: Full Personal Name as Subject FPS
50: Place of Publication PL
51: Publication History Status PHST
52: Publication Status PST
53: Publication Type PT
54: Publishing Model PUBM
55: PubMed Central Identifer PMC
56: PubMed Central Release PMCR
57: PubMed Unique Identifier PMID
58: Registry Number/EC Number RN
59: Substance Name NM
60: Secondary Source ID SI
61: Source SO
62: Space Flight Mission SFM
63: Status STAT
64: Subset SB
65: Title TI
66: Transliterated Title TT
67: Volume VI
68: Volume Title VTI
Field Abbreviation
>
> ## If we want to capture the information after the initial h3 line
> ## of the input, e.g. the rest column below which contains a
> ## description/example for each field, then capture_all_str can be
> ## used on the full input file.
> h3.fields.dt <- nc::capture_all_str(
+ no.comments,
+ h3.pattern,
+ '</h3>\n',
+ rest="(?:.*\n)+?", #exercise: get the examples.
+ "<hr />\n")
> (h3.each.field <- h3.fields.dt[, nc::capture_all_str(
+ fields.abbrevs, field.abbrev.pattern),
+ by=fields.abbrevs])
fields.abbrevs
<char>
1: Abstract (AB)
2: Copyright Information (CI)
3: Affiliation (AD)
4: Investigator Affiliation (IRAD)
5: Article Identifier (AID)
6: Author (AU)
7: Author Identifier (AUID)
8: Full Author (FAU)
9: Book Title (BTI)
10: Collection Title (CTI)
11: Comments/Corrections (See fields and field tags listed below.)
12: Conflict of Interest Statement (COIS)
13: Corporate Author (CN)
14: Create Date (CRDT)
15: Date Completed (DCOM)
16: Date Created (DA)
17: Date Last Revised (LR)
18: Date of Electronic Publication (DEP)
19: Date of Publication (DP)
20: Editor (ED) and Full Editor Name (FED)
21: Editor (ED) and Full Editor Name (FED)
22: Edition (EN)
23: Entrez Date (EDAT)
24: Gene Symbol (GS): not currently input
25: General Note (GN)
26: Grant Number (GR)
27: Investigator Name (IR) and Full Investigator Name (FIR)
28: Investigator Name (IR) and Full Investigator Name (FIR)
29: ISBN (ISBN)
30: ISSN (IS)
31: Issue (IP)
32: Journal Title Abbreviation (TA)
33: Journal Title (JT)
34: Language (LA)
35: Location Identifier (LID)
36: Manuscript Identifier (MID)
37: MeSH Date (MHDA)
38: MeSH Terms (MH)
39: NLM Unique ID (JID)
40: Number of References (RF)
41: Other Abstract (OAB)
42: Other Copyright Information (OCI)
43: Other ID (OID)
44: Other Term (OT)
45: Other Term Owner (OTO)
46: Owner (OWN)
47: Pagination (PG)
48: Personal Name as Subject (PS)
49: Full Personal Name as Subject (FPS)
50: Place of Publication (PL)
51: Publication History Status (PHST)
52: Publication Status (PST)
53: Publication Type (PT)
54: Publishing Model (PUBM)
55: PubMed Central Identifer (PMC)
56: PubMed Central Release (PMCR)
57: PubMed Unique Identifier (PMID)
58: Registry Number/EC Number (RN)
59: Substance Name (NM)
60: Secondary Source ID (SI)
61: Source (SO)
62: Space Flight Mission (SFM)
63: Status (STAT)
64: Subset (SB)
65: Title (TI)
66: Transliterated Title (TT)
67: Volume (VI)
68: Volume Title (VTI)
fields.abbrevs
Field Abbreviation
<char> <char>
1: Abstract AB
2: Copyright Information CI
3: Affiliation AD
4: Investigator Affiliation IRAD
5: Article Identifier AID
6: Author AU
7: Author Identifier AUID
8: Full Author FAU
9: Book Title BTI
10: Collection Title CTI
11: Comments/Corrections See fields and field tags listed below.
12: Conflict of Interest Statement COIS
13: Corporate Author CN
14: Create Date CRDT
15: Date Completed DCOM
16: Date Created DA
17: Date Last Revised LR
18: Date of Electronic Publication DEP
19: Date of Publication DP
20: Editor ED
21: Full Editor Name FED
22: Edition EN
23: Entrez Date EDAT
24: Gene Symbol GS
25: General Note GN
26: Grant Number GR
27: Investigator Name IR
28: Full Investigator Name FIR
29: ISBN ISBN
30: ISSN IS
31: Issue IP
32: Journal Title Abbreviation TA
33: Journal Title JT
34: Language LA
35: Location Identifier LID
36: Manuscript Identifier MID
37: MeSH Date MHDA
38: MeSH Terms MH
39: NLM Unique ID JID
40: Number of References RF
41: Other Abstract OAB
42: Other Copyright Information OCI
43: Other ID OID
44: Other Term OT
45: Other Term Owner OTO
46: Owner OWN
47: Pagination PG
48: Personal Name as Subject PS
49: Full Personal Name as Subject FPS
50: Place of Publication PL
51: Publication History Status PHST
52: Publication Status PST
53: Publication Type PT
54: Publishing Model PUBM
55: PubMed Central Identifer PMC
56: PubMed Central Release PMCR
57: PubMed Unique Identifier PMID
58: Registry Number/EC Number RN
59: Substance Name NM
60: Secondary Source ID SI
61: Source SO
62: Space Flight Mission SFM
63: Status STAT
64: Subset SB
65: Title TI
66: Transliterated Title TT
67: Volume VI
68: Volume Title VTI
Field Abbreviation
>
> ## Either method of capturing abbreviations gives the same result.
> identical(first.each.field, h3.each.field)
[1] TRUE
>
> ## but the capture_all_str method returns the additional rest column
> ## which contains data after the initial h3 line.
> names(first.fields.dt)
[1] "name" "fields.abbrevs"
> names(h3.fields.dt)
[1] "name" "fields.abbrevs" "rest"
> cat(h3.fields.dt[fields.abbrevs=="Volume (VI)", rest])
<p>The volume number of the journal in which the article was published is recorded here.</p>
<p class="examplekm">Examples:<br />VI - 7<br />VI - 5 Spec No<br />VI - 49 Suppl 20</p>
<p>Some records (especially records from <a href="/databases/databases_oldmedline.html">OLDMEDLINE</a>) contain the Issue field but lack the Volume field; some contain the Volume field but lack the Issue field; and some records contain Volume and Issue data in the Volume element.</p>
>
> ## There are 66 Field rows across three tables.
> a.href <- list('<a href=[^>]+>')
> (td.vec <- fields.vec[240:280])
[1] "<td><a href=\"#ab\">Abstract</a></td>"
[2] "<td><a href=\"#ab\">(AB)</a></td>"
[3] "</tr>"
[4] "<tr style=\"background-color: #cccccc;\">"
[5] "<td><a href=\"#ci\">Copyright Information</a></td>"
[6] "<td>"
[7] "<div><a href=\"#ci\">(CI)</a></div>"
[8] "</td>"
[9] "</tr>"
[10] "<tr>"
[11] "<td><a href=\"#ad\">Affiliation</a></td>"
[12] "<td>"
[13] "<div><a href=\"#ad\">(AD)</a></div>"
[14] "</td>"
[15] "</tr>"
[16] "<tr style=\"background-color: #cccccc;\">"
[17] "<td><a href=\"#irad\">Investigator Affiliation</a></td>"
[18] "<td>"
[19] "<div><a href=\"#irad\">(IRAD)</a></div>"
[20] "</td>"
[21] "</tr>"
[22] "<tr>"
[23] "<td><a href=\"#aid\">Article Identifier</a></td>"
[24] "<td>"
[25] "<div><a href=\"#aid\">(AID)</a></div>"
[26] "</td>"
[27] "</tr>"
[28] "<tr style=\"background-color: #cccccc;\">"
[29] "<td><a href=\"#au\">Author</a></td>"
[30] "<td>"
[31] "<div><a href=\"#au\">(AU)</a></div>"
[32] "</td>"
[33] "</tr>"
[34] "<tr>"
[35] "<td><a href=\"#auid\">Author Identifier</a></td>"
[36] "<td><a href=\"#auid\">(AUID)</a></td>"
[37] "</tr>"
[38] "<tr>"
[39] "<td style=\"background-color: #cccccc;\"><a href=\"#fau\">Full Author</a></td>"
[40] "<td style=\"background-color: #cccccc;\">"
[41] "<div><a href=\"#fau\">(FAU)</a></div>"
> fields.pattern <- list(
+ "<td.*?>",
+ a.href,
+ Fields="[^()<]+",
+ "</a></td>\n")
> (td.only.Fields <- nc::capture_all_str(fields.vec, fields.pattern))
Fields
<char>
1: Abstract
2: Copyright Information
3: Affiliation
4: Investigator Affiliation
5: Article Identifier
6: Author
7: Author Identifier
8: Full Author
9: Book Title
10: Collection Title
11: Comments/Corrections
12: Conflict of Interest Statement
13: Corporate Author
14: Create Date
15: Date Completed
16: Date Created
17: Date Last Revised
18: Date of Electronic Publication
19: Date of Publication
20: Edition
21: Editor and Full Editor Name
22: Entrez Date
23: Gene Symbol
24: General Note
25: Grant Number
26: Investigator Name and Full Investigator Name
27: ISBN
28: ISSN
29: Issue
30: Journal Title Abbreviation
31: Journal Title
32: Language
33: Location Identifier
34: Manuscript Identifier
35: MeSH Date
36: MeSH Terms
37: NLM Unique ID
38: Number of References
39: Other Abstract
40: Other Copyright Information
41: Other ID
42: Other Term
43: Other Term Owner
44: Owner
45: Pagination
46: Personal Name as Subject
47: Full Personal Name as Subject
48: Place of Publication
49: Publication History Status
50: Publication Status
51: Publication Type
52: Publishing Model
53: PubMed Central Identifier
54: PubMed Central Release
55: PubMed Unique Identifier
56: Registry Number/EC Number
57: Substance Name
58: Secondary Source ID
59: Source
60: Space Flight Mission
61: Status
62: Subset
63: Title
64: Transliterated Title
65: Volume
66: Volume Title
Fields
>
> ## Extract Fields and Abbreviations. Careful: most fields have one
> ## abbreviation, but one field has none, and two fields have two.
> (td.fields.dt <- nc::capture_all_str(
+ fields.vec,
+ fields.pattern,
+ "<td[^>]*>",
+ "(?:\n<div>)?",
+ a.href, "?",
+ abbrevs=".*?",
+ "</"))
Fields abbrevs
<char> <char>
1: Abstract (AB)
2: Copyright Information (CI)
3: Affiliation (AD)
4: Investigator Affiliation (IRAD)
5: Article Identifier (AID)
6: Author (AU)
7: Author Identifier (AUID)
8: Full Author (FAU)
9: Book Title (BTI)
10: Collection Title (CTI)
11: Comments/Corrections
12: Conflict of Interest Statement (COIS)
13: Corporate Author (CN)
14: Create Date (CRDT)
15: Date Completed (DCOM)
16: Date Created (DA)
17: Date Last Revised (LR)
18: Date of Electronic Publication (DEP)
19: Date of Publication (DP)
20: Edition (EN)
21: Editor and Full Editor Name (ED)<br />(FED)
22: Entrez Date (EDAT)
23: Gene Symbol (GS)
24: General Note (GN)
25: Grant Number (GR)
26: Investigator Name and Full Investigator Name (IR) (FIR)
27: ISBN (ISBN)
28: ISSN (IS)
29: Issue (IP)
30: Journal Title Abbreviation (TA)
31: Journal Title (JT)
32: Language (LA)
33: Location Identifier (LID)
34: Manuscript Identifier (MID)
35: MeSH Date (MHDA)
36: MeSH Terms (MH)
37: NLM Unique ID (JID)
38: Number of References (RF)
39: Other Abstract (OAB)
40: Other Copyright Information (OCI)
41: Other ID (OID)
42: Other Term (OT)
43: Other Term Owner (OTO)
44: Owner (OWN)
45: Pagination (PG)
46: Personal Name as Subject (PS)
47: Full Personal Name as Subject (FPS)
48: Place of Publication (PL)
49: Publication History Status (PHST)
50: Publication Status (PST)
51: Publication Type (PT)
52: Publishing Model (PUBM)
53: PubMed Central Identifier (PMC)
54: PubMed Central Release (PMCR)
55: PubMed Unique Identifier (PMID)
56: Registry Number/EC Number (RN)
57: Substance Name (NM)
58: Secondary Source ID (SI)
59: Source (SO)
60: Space Flight Mission (SFM)
61: Status (STAT)
62: Subset (SB)
63: Title (TI)
64: Transliterated Title (TT)
65: Volume (VI)
66: Volume Title (VTI)
Fields abbrevs
>
> ## Get each individual abbreviation from the previously captured td
> ## data.
> td.each.field <- td.fields.dt[, {
+ f <- nc::capture_all_str(
+ Fields,
+ Field=".*?",
+ "(?:$| and )")
+ a <- nc::capture_all_str(
+ abbrevs,
+ "\\(",
+ Abbreviation="[^)]+",
+ "\\)")
+ if(nrow(a)==0)list() else cbind(f, a)
+ }, by=Fields]
> str(td.each.field)
Classes ‘data.table’ and 'data.frame': 67 obs. of 3 variables:
$ Fields : chr "Abstract" "Copyright Information" "Affiliation" "Investigator Affiliation" ...
$ Field : chr "Abstract" "Copyright Information" "Affiliation" "Investigator Affiliation" ...
$ Abbreviation: chr "AB" "CI" "AD" "IRAD" ...
- attr(*, ".internal.selfref")=<pointer: 0x56302a20f070>
> td.each.field[td.fields.dt, .(
+ count=.N
+ ), on=.(Fields), by=.EACHI][order(count)]
Fields count
<char> <int>
1: Comments/Corrections 0
2: Abstract 1
3: Copyright Information 1
4: Affiliation 1
5: Investigator Affiliation 1
6: Article Identifier 1
7: Author 1
8: Author Identifier 1
9: Full Author 1
10: Book Title 1
11: Collection Title 1
12: Conflict of Interest Statement 1
13: Corporate Author 1
14: Create Date 1
15: Date Completed 1
16: Date Created 1
17: Date Last Revised 1
18: Date of Electronic Publication 1
19: Date of Publication 1
20: Edition 1
21: Entrez Date 1
22: Gene Symbol 1
23: General Note 1
24: Grant Number 1
25: ISBN 1
26: ISSN 1
27: Issue 1
28: Journal Title Abbreviation 1
29: Journal Title 1
30: Language 1
31: Location Identifier 1
32: Manuscript Identifier 1
33: MeSH Date 1
34: MeSH Terms 1
35: NLM Unique ID 1
36: Number of References 1
37: Other Abstract 1
38: Other Copyright Information 1
39: Other ID 1
40: Other Term 1
41: Other Term Owner 1
42: Owner 1
43: Pagination 1
44: Personal Name as Subject 1
45: Full Personal Name as Subject 1
46: Place of Publication 1
47: Publication History Status 1
48: Publication Status 1
49: Publication Type 1
50: Publishing Model 1
51: PubMed Central Identifier 1
52: PubMed Central Release 1
53: PubMed Unique Identifier 1
54: Registry Number/EC Number 1
55: Substance Name 1
56: Secondary Source ID 1
57: Source 1
58: Space Flight Mission 1
59: Status 1
60: Subset 1
61: Title 1
62: Transliterated Title 1
63: Volume 1
64: Volume Title 1
65: Editor and Full Editor Name 2
66: Investigator Name and Full Investigator Name 2
Fields count
>
> ## There is a typo in the data captured from the h3 headings.
> td.each.field[!Field %in% h3.each.field$Field]
Fields Field Abbreviation
<char> <char> <char>
1: PubMed Central Identifier PubMed Central Identifier PMC
> h3.each.field[!Field %in% td.each.field$Field]
fields.abbrevs
<char>
1: Comments/Corrections (See fields and field tags listed below.)
2: PubMed Central Identifer (PMC)
Field Abbreviation
<char> <char>
1: Comments/Corrections See fields and field tags listed below.
2: PubMed Central Identifer PMC
>
> ## Abbreviations are consistent.
> td.each.field[!Abbreviation %in% h3.each.field$Abbreviation]
Empty data.table (0 rows and 3 cols): Fields,Field,Abbreviation
> h3.each.field[!Abbreviation %in% td.each.field$Abbreviation]
fields.abbrevs
<char>
1: Comments/Corrections (See fields and field tags listed below.)
Field Abbreviation
<char> <char>
1: Comments/Corrections See fields and field tags listed below.
>
> ## There is a a table that provides a description of each comment
> ## type.
> (comment.vec <- fields.vec[840:860])
[1] "<tr>"
[2] "<th><strong>Comment or Correction Type</strong></th>"
[3] "<th><strong>MEDLINE Display Field Tag</strong></th>"
[4] "<th><strong>Description</strong></th>"
[5] "</tr>"
[6] "<tr>"
[7] "<td><strong>Comment in</strong></td>"
[8] "<td><strong>(CIN)</strong></td>"
[9] "<td>cites the reference containing a commentary about the article (appears on citation for original article); began use with journal issues published in 1989.</td>"
[10] "</tr>"
[11] "<tr>"
[12] "<td><strong>Comment on</strong></td>"
[13] "<td><strong>(CON)</strong></td>"
[14] "<td>cites the reference upon which the article comments; began use with journal issues published in 1989.</td>"
[15] "</tr>"
[16] "<tr>"
[17] "<td><strong>Erratum in</strong></td>"
[18] "<td><strong>(EIN)</strong></td>"
[19] "<td>cites a published erratum to the article (appears on citation for original article); began use in 1987.</td>"
[20] "</tr>"
[21] "<tr>"
> comment.dt <- nc::capture_all_str(
+ fields.vec,
+ "<td><strong>",
+ Field="[^<]+",
+ "</strong></td>\n",
+ "<td><strong>\\(",
+ Abbreviation="[^)]+",
+ "\\)</strong></td>\n",
+ "<td>",
+ description=".*",
+ "</td>\n")
> str(comment.dt)
Classes ‘data.table’ and 'data.frame': 18 obs. of 3 variables:
$ Field : chr "Comment in" "Comment on" "Erratum in" "Erratum for" ...
$ Abbreviation: chr "CIN" "CON" "EIN" "EFR" ...
$ description : chr "cites the reference containing a commentary about the article (appears on citation for original article); began"| __truncated__ "cites the reference upon which the article comments; began use with journal issues published in 1989." "cites a published erratum to the article (appears on citation for original article); began use in 1987." "cites the original article for which there is a published erratum. As of 2016, partial retractions are considered errata." ...
- attr(*, ".internal.selfref")=<pointer: 0x56302a20f070>
>
> ## Join to original PMC citation file in order to see what the
> ## abbreviations used in that file mean.
> all.abbrevs <- rbind(
+ td.each.field[, .(Field, Abbreviation)],
+ comment.dt[, .(Field, Abbreviation)])
> all.abbrevs[pmc.dt, .(
+ Abbreviation,
+ Field,
+ value=substr(value, 1, 20)
+ ), on=.(Abbreviation)]
Abbreviation Field value
<char> <char> <char>
1: PMID PubMed Unique Identifier 21113027
2: OWN Owner NLM
3: STAT Status MEDLINE
4: DCOM Date Completed 20110512
5: LR Date Last Revised 20181113
6: IS ISSN 1362-4962 (Electroni
7: IS ISSN 0305-1048 (Print)
8: IS ISSN 0305-1048 (Linking)
9: VI Volume 39
10: IP Issue 4
11: DP Date of Publication 2011 Mar
12: TI Title A manually curated C
13: PG Pagination e25
14: LID Location Identifier 10.1093/nar/gkq1187
15: AB Abstract Chromatin immunoprec
16: FAU Full Author Rye, Morten Beck
17: AU Author Rye MB
18: AD Affiliation Department of Cancer
19: FAU Full Author Sætrom, Pål
20: AU Author Sætrom P
21: FAU Full Author Drabløs, Finn
22: AU Author Drabløs F
23: LA Language eng
24: PT Publication Type Evaluation Studies
25: PT Publication Type Journal Article
26: PT Publication Type Research Support, No
27: DEP Date of Electronic Publication 20101126
28: TA Journal Title Abbreviation Nucleic Acids Res
29: JT Journal Title Nucleic acids resear
30: JID NLM Unique ID 0411011
31: RN Registry Number/EC Number 0 (Transcription Fac
32: SB Subset IM
33: MH MeSH Terms Benchmarking
34: MH MeSH Terms Binding Sites
35: MH MeSH Terms *Chromatin Immunopre
36: MH MeSH Terms *High-Throughput Nuc
37: MH MeSH Terms *Software
38: MH MeSH Terms Transcription Factor
39: PMC PubMed Central Identifier PMC3045577
40: EDAT Entrez Date 2010/11/30 06:00
41: MHDA MeSH Date 2011/05/13 06:00
42: CRDT Create Date 2010/11/30 06:00
43: PHST Publication History Status 2010/11/30 06:00 [en
44: PHST Publication History Status 2010/11/30 06:00 [pu
45: PHST Publication History Status 2011/05/13 06:00 [me
46: AID Article Identifier 10.1093/nar/gkq1187
47: AID Article Identifier gkq1187 [pii]
48: AID Article Identifier gkq1187 [pii]
49: PST Publication Status ppublish
50: SO Source Nucleic Acids Res. 2
Abbreviation Field value
>
> ## There is a listing of examples for each comment type.
> (comment.ex.dt <- nc::capture_all_str(
+ fields.vec[938],
+ "br />\\s*",
+ Abbreviation="[A-Z]+",
+ "\\s*-\\s*",
+ citation="[^<]+?",
+ list(
+ "[.] ",
+ nc::field("PMID", ": ", "[0-9]+")
+ ), "?",
+ "<"))
Abbreviation citation
<char> <char>
1: CON Dev Cell. 2002 Jul;3(1):85-97
2: CIN N Engl J Med. 2003 Jul 17;349(3):211-2
3: CRI Orthop Nurs. 2003 May-Jun;22(3):232-9
4: CRF Biochemistry. 1994 May 10;33(18):5614-22
5: EIN Acta Obstet Gynecol Scand. 2003 Jan;82(1):102
6: EFR J Arthroplasty. 2002 Jun;17(4):524-6
7: RIN J Biochem Mol Biol. 2002 Nov 30;35(6):642
8: ROF Ware FE, Lehrman MA. J Biol Chem. 1996 Jun 14;271(24):13935-8
9: UIN Cochrane Database Syst Rev. 2002;(3):CD003688
10: UOF Cochrane Database Syst Rev. 2002;(2):CD003680
11: SPIN Ann Intern Med. 2003 Jun 3;138(11):I60
12: ORI Ann Intern Med. 2003 Jun 3;138(11):907-16
PMID
<char>
1: 12110170
2: 12867604
3: 12872752
4: 8180186
5:
6: 12066289
7: 12476908
8: 8663248
9: 12137706
10: 12076500
11: 12779314
12: 12779301
>
> ## Join abbreviations to see what kind of comments.
> all.abbrevs[comment.ex.dt, on=.(Abbreviation)]
Field Abbreviation
<char> <char>
1: Comment on CON
2: Comment in CIN
3: Corrected and Republished in CRI
4: Corrected and Republished from CRF
5: Erratum in EIN
6: Erratum for EFR
7: Retraction in RIN
8: Retraction of ROF
9: Update in UIN
10: Update of UOF
11: Summary for patients in SPIN
12: Original report in ORI
citation PMID
<char> <char>
1: Dev Cell. 2002 Jul;3(1):85-97 12110170
2: N Engl J Med. 2003 Jul 17;349(3):211-2 12867604
3: Orthop Nurs. 2003 May-Jun;22(3):232-9 12872752
4: Biochemistry. 1994 May 10;33(18):5614-22 8180186
5: Acta Obstet Gynecol Scand. 2003 Jan;82(1):102
6: J Arthroplasty. 2002 Jun;17(4):524-6 12066289
7: J Biochem Mol Biol. 2002 Nov 30;35(6):642 12476908
8: Ware FE, Lehrman MA. J Biol Chem. 1996 Jun 14;271(24):13935-8 8663248
9: Cochrane Database Syst Rev. 2002;(3):CD003688 12137706
10: Cochrane Database Syst Rev. 2002;(2):CD003680 12076500
11: Ann Intern Med. 2003 Jun 3;138(11):I60 12779314
12: Ann Intern Med. 2003 Jun 3;138(11):907-16 12779301
>
> ## parsing bibtex file.
> refs.bib <- system.file(
+ "extdata", "namedCapture-refs.bib", package="nc")
> refs.vec <- readLines(refs.bib)
> at.lines <- grep("@", refs.vec, value=TRUE)
> str(at.lines)
chr [1:24] " @Manual{namedCapture," " @Manual{TRE," " @Manual{re2r," ...
> refs.dt <- nc::capture_all_str(
+ refs.vec,
+ "@",
+ type="[^{]+",
+ "[{]",
+ ref="[^,]+",
+ ",\n",
+ fields="(?:.*\n)+?.*",
+ "[}]\\s*(?:$|\n)")
> str(refs.dt)
Classes ‘data.table’ and 'data.frame': 24 obs. of 3 variables:
$ type : chr "Manual" "Manual" "Manual" "Manual" ...
$ ref : chr "namedCapture" "TRE" "re2r" "rematch2" ...
$ fields: chr " title = {namedCapture: Named Capture Regular Expressions},\n author = {Toby Dylan Hocking},\n year = "| __truncated__ " title = {TRE: The free and portable approximate regex matching library},\n author = {Ville Laurikari},\n"| __truncated__ " title = {re2r: RE2 Regular Expression},\n author = {Qin Wenfeng},\n year = {2017},\n note = {R pac"| __truncated__ " title = {rematch2: Tidy Output from Regular Expression Matching},\n author = {Gábor Csárdi},\n year ="| __truncated__ ...
- attr(*, ".internal.selfref")=<pointer: 0x56302a20f070>
>
> ## parsing each field of each entry.
> eq.lines <- grep("=", refs.vec, value=TRUE)
> str(eq.lines)
chr [1:140] " title = {namedCapture: Named Capture Regular Expressions}," ...
> strip <- function(x)sub("^\\s*\\{*", "", sub("\\}*,?$", "", x))
> refs.fields <- refs.dt[, nc::capture_all_str(
+ fields,
+ "\\s+",
+ variable="\\S+",
+ "\\s+=",
+ value=".*", strip),
+ by=.(type, ref)]
> str(refs.fields)
Classes ‘data.table’ and 'data.frame': 140 obs. of 4 variables:
$ type : chr "Manual" "Manual" "Manual" "Manual" ...
$ ref : chr "namedCapture" "namedCapture" "namedCapture" "namedCapture" ...
$ variable: chr "title" "author" "year" "note" ...
$ value : chr "namedCapture: Named Capture Regular Expressions" "Toby Dylan Hocking" "2019" "R package version 2019.01.14" ...
- attr(*, ".internal.selfref")=<pointer: 0x56302a20f070>
> with(refs.fields[ref=="HockingUseR2011"], structure(
+ as.list(value), names=variable))
$author
[1] "Toby Dylan Hocking"
$title
[1] "Fast, named capture regular expressions in R 2.14"
$year
[1] "2011"
$url
[1] "http://web.warwick.ac.uk/statsdept/user-2011/TalkSlides/Lightening/2-StatisticsAndProg\\_3-Hocking.pdf"
$booktitle
[1] "useR 2011 conference proceedings"
> ## the URL of my talk is now
> ## https://user2011.r-project.org/TalkSlides/Lightening/2-StatisticsAndProg_3-Hocking.pdf
>
> if(!grepl("solaris", R.version$platform)){#To avoid CRAN check error on solaris
+ ## Parsing wikimedia tables: each begins with {| and ends with |}.
+ emoji.txt.gz <- system.file(
+ "extdata", "wikipedia-emoji-text.txt.gz", package="nc")
+ tables <- nc::capture_all_str(
+ emoji.txt.gz,
+ "\n[{][|]",
+ first=".*",
+ '\n[|][+] style="',
+ nc::field("font-size", ":", '.*?'),
+ '" [|] ',
+ title=".*",
+ lines="(?:\n.*)*?",
+ "\n[|][}]")
+ str(tables)
+ ## Rows are separated by |-
+ rows.dt <- tables[, {
+ row.vec <- strsplit(lines, "|-", fixed=TRUE)[[1]][-1]
+ .(row.i=seq_along(row.vec), row=row.vec)
+ }, by=title]
+ str(rows.dt)
+ ## Try to parse columns from each row. Doesn't work for second table
+ ## https://en.wikipedia.org/w/index.php?title=Emoji&oldid=920745513#Skin_color
+ ## because some entries have rowspan=2.
+ contents.dt <- rows.dt[, nc::capture_all_str(
+ row,
+ "[|] ",
+ content=".*?",
+ "(?: [|]|\n|$)"),
+ by=.(title, row.i)]
+ contents.dt[, .(cols=.N), by=.(title, row.i)]
+ ## Make data table from
+ ## https://en.wikipedia.org/w/index.php?title=Emoji&oldid=920745513#Emoji_versus_text_presentation
+ contents.dt[, col.i := 1:.N, by=.(title, row.i)]
+ data.table::dcast(
+ contents.dt[title=="Sample emoji variation sequences"],
+ row.i ~ col.i,
+ value.var="content")
+ }
Classes ‘data.table’ and 'data.frame': 2 obs. of 4 variables:
$ first : chr " border=\"1\" cellspacing=\"0\" cellpadding=\"5\" class=\"wikitable nounderlines\" style=\"border-collapse:coll"| __truncated__ " border=\"1\" cellspacing=\"0\" cellpadding=\"5\" class=\"wikitable nounderlines\" style=\"border-collapse:coll"| __truncated__
$ font-size: chr " 67%" "small"
$ title : chr "Sample emoji variation sequences" "Sample use of Fitzpatrick modifiers"
$ lines : chr "\n|- style=\"background:#F8F8F8;font-size: 67%\"\n! scope=\"col\" style=\"text-align:right\" | U+ || 2139 || 23"| __truncated__ "\n|-style=\"background:#F8F8F8;font-size:67%\"\n! scope=\"col\" colspan=\"2\" style=\"text-align:left\" | Code "| __truncated__
- attr(*, ".internal.selfref")=<pointer: 0x56302a20f070>
Classes ‘data.table’ and 'data.frame': 19 obs. of 3 variables:
$ title: chr "Sample emoji variation sequences" "Sample emoji variation sequences" "Sample emoji variation sequences" "Sample emoji variation sequences" ...
$ row.i: int 1 2 3 4 5 6 1 2 3 4 ...
$ row : chr " style=\"background:#F8F8F8;font-size: 67%\"\n! scope=\"col\" style=\"text-align:right\" | U+ || 2139 || 231B |"| __truncated__ " style=\"background:#F8F8F8;font-size: 67%\"\n! scope=\"col\" style=\"text-align:left\" | default presenta"| __truncated__ "\n! scope=\"col\" style=\"background:#F8F8F8;font-size: 67%;text-align:left\" | base code point\n| ℹ "| __truncated__ "\n! scope=\"col\" style=\"background:#F8F8F8;font-size: 67%;text-align:left\" | base+VS15 (text)\n| {{emoji pre"| __truncated__ ...
- attr(*, ".internal.selfref")=<pointer: 0x56302a20f070>
Error in `[.data.table`(contents.dt, , `:=`(col.i, 1:.N), by = .(title, :
attempt access index 3/3 in VECTOR_ELT
Calls: [ -> [.data.table
Execution halted
- checking for unstated dependencies in ‘tests’ ... OK
- checking tests ... [14s/16s] OK
Running ‘testthat.R’ [14s/16s]
- checking for unstated dependencies in vignettes ... OK
- checking package vignettes ... OK
- checking re-building of vignette outputs ... [32s/48s] ERROR
Error(s) in re-building vignettes:
...
--- re-building ‘v0-overview.Rmd’ using rmarkdown
[WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead.
--- finished re-building ‘v0-overview.Rmd’
--- re-building ‘v1-capture-first.Rmd’ using rmarkdown
[WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead.
--- finished re-building ‘v1-capture-first.Rmd’
--- re-building ‘v2-capture-all.Rmd’ using rmarkdown
[WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead.
--- finished re-building ‘v2-capture-all.Rmd’
--- re-building ‘v3-capture-melt.Rmd’ using rmarkdown
[WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead.
--- finished re-building ‘v3-capture-melt.Rmd’
--- re-building ‘v4-comparisons.Rmd’ using rmarkdown
[WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead.
--- finished re-building ‘v4-comparisons.Rmd’
--- re-building ‘v5-helpers.Rmd’ using rmarkdown
[WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead.
--- finished re-building 'v5-helpers.Rmd'
--- re-building ‘v6-engines.Rmd’ using rmarkdown
[WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead.
--- finished re-building ‘v6-engines.Rmd’
--- re-building ‘v7-capture-glob.Rmd’ using rmarkdown
Quitting from v7-capture-glob.Rmd:257-272 [unnamed-chunk-18]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<error/rlang_error>
Error in `[.data.table`:
! attempt access index 6/6 in VECTOR_ELT
---
Backtrace:
▆
1. ├─...[]
2. └─data.table:::`[.data.table`(...)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Error: processing vignette 'v7-capture-glob.Rmd' failed with diagnostics:
attempt access index 6/6 in VECTOR_ELT
--- failed re-building ‘v7-capture-glob.Rmd’
SUMMARY: processing the following file failed:
‘v7-capture-glob.Rmd’
Error: Vignette re-building failed.
Execution halted
- checking PDF version of manual ... [4s/6s] OK
- checking HTML version of manual ... [1s/2s] OK
- checking for non-standard things in the check directory ... OK
- DONE
Status: 2 ERRORs