This is an old revision of the document!


Batch Uploading ETDs to ScholarWorks

Transform MARC.XML file of short-version bibs.+IA URLs to Bepress XML:

Preparation:

Masters theses should be in a separate file from doctoral dissertations.

Remove diacritics!

Short versions used for metadata export should include the following fields:

=LDR  01608nam a2200361K  4500
=100  1\$aSmith, Philip H.
=245  10$aTypes of corn suited to Massachusetts conditions
=260  \\$c1911.
=502  \\$aThesis (M.S.)--Massachusetts Agricultural College, 1911.
=650  \0$aCorn$zMassachusetts.
=690  \\$aTheses$xHorticulture$xMasters.
=776  $w(OCoLC)18237440
=856  41 %%$uhttp://archive.org/details/typesofcornsuite00smit%%

Creating the Bepress spreadsheet:

  1. Save the .xml file as “marc.xml” in the metadata 2 folder on my U: drive. NOTE: Only the etd2bepress.exe program and the current file (saved as marc.xml) should be in this folder. When finished, delete file along with created bpimport spreadsheet.
  2. Go to W:\ETD Digitization Project/ETD/command line instructions for running Aaron’s program.txt. Move to 2nd computer screen for cut and pasting.
  3. Go to Windows start button and type cmd into Search programs and files box. Hit Enter.
  4. On command line, type; u:
  5. Cut and paste: cd “metadata 2” from Aaron’s instructions. (NOTE: Use right-click instead of shortcut keys.)
  6. Cut and paste: etd2bepress marc.xml from Aaron’s instructions.

This should enter a spreadsheet into the metadata 2 folder: bpimport_date. RENAME with the date and Masters/PhD, and save to personal folder.

Make adjustments to Bepress spreadsheet

  1. Fix spaces in URLs before .pdf and before / (Watch out for titles ending in /)
  2. Change double colons in the titles to single colons. :: = :
  3. Find all opt-outs pertaining to the list(s) being worked on, and enter “campus” in document-type column.
  4. Masters: All other entries will have “open” in this column
  5. PhD’s: All other entries will have “dissertation” in this column.
  6. Masters: Correct the Department column heading to degree_prog
  7. Keyword fix: Go to scholmasters or scolphd file in MarcEdit and download 600's, 610's, 650’s and 651’s with all subheadings to an Excel spreadsheet. Combine lines per record, remove periods, $x, $y, $z. Remove comma between Subject headings and Subdivisions, retain commas between separate entries only. Remove parentheses. Run Compare line differences using the 776 value to be sure MarcEdit/bpimport matches, then paste new Keyword entries into the bpimport spreadsheet.
  8. NOTE: The final file needs to be saved (and uploaded) as “Excel 97-2003 Worksheet”!!!!

Upload Bepress spreadsheet to the appropriate series in Scholarworks:

Bepress Doctoral Dissertations 1911-2013 series

Bepress Masters Theses 1896 - February 2014 series

  1. Search Scholarworks site and click it open. Click My Account (in upper right corner). Log in with email address and password.
  2. Click Manage Theses or Manage Dissertations, “Batch upload Excel” (on left-side column).
  3. Click Browse and find the correct file. Open it. Hit Upload. Hit Update.

Wait for confirming email

  1. First message: Masters Theses (series)/Doctoral Dissertations (series) queued update complete.
  2. Second message: Import results for Masters Theses/Doctoral Dissertations. This will list successfully imported records, or report errors. This is the important one! If the upload succeeded, click the link sent in the email to access Scholarworks. Click “Update Site” (on left margin of screen). If there are errors, the problems should be displayed; look down to the bottom to find single or a few records involved. Unicode error=unaccepted diacritics/“funky” characters.
  3. Third message: Masters Theses/Doctoral Dissertations queued update complete.
  4. Return to Scholarworks site and search/check a couple of the items to see if the PDF looks OK. Can also paste the URL into my browser.

Combining 2-part ETDs in Scholarworks

  1. Search any Masters or PhDs on the Scholarworks site, which originally included URLs for two separate volumes. See Batch Conversion of ETDs to e-records with Scholarworks URLs, Insert the Internet Archive URLs from the pick list, NOTE RE duplicate Sys#'s.
  2. TIP: Since scanned items in Scholarworks are difficult to navigate from front to back, it's best to use the Internet Archive to determine how the 300 field should be constructed in the e-version of a 2-part ETD, before combining the two URLs.
  3. Copy the two I.A. URLs to be combined (one at a time) and enter them into the command line of browser, to call up the scanned item from Internet Archive. When the item appears, go to window in the bottom right, labeled DOWNLOAD OPTIONS.
  4. Scroll to PDF and click it. This will open the item in my browser. Go to little white “down arrow” at left top of dark field, and click.
  5. At resulting window, choose: Open with Adobe Acrobat (default). Click OK.
  6. In next window, Go to File/Save As… Choose the place to save the file (my EDT folder) and save file as a .pdf. Repeat with the second file. (Easy way to bring up 2nd file: replace the 01’s with 02’s in the command line URL.)
  7. Open Adobe Acrobat (in Programs). Click File (in top task bar)/Create/Combine files into a single PDF.
  8. Click Add Files/Add Files. Select files and Open. NOTE: Be sure to put the first part of the ETD on the left side. Also check to make sure that 01 is actually Part 1 of the Dissertation; in one of my examples, the parts were reversed.
  9. Click Combine Files. A green bar will appear. Go to File/Save As … Call it a new name as a .pdf.

Generate spreadsheet including SW URLs for e-conversions

  1. From the Scholarworks My Account page, click Manage Theses/Dissertations under the appropriate series.
  2. Click Batch revise Excel (on left margin), then click “Generate” in the box under “To revise content via an Excel spreadsheet.“ This will take a while, but when done, will add the downloaded file under today’s date, in the list at the bottom. Click Download and it will automatically open it in Excel.
  3. Save this generated spreadsheet in personal folder under a new name (i.e., [date,etc.]phd(or masters)URLs776), and delete all columns except for title, author OCLC# and URL. NOTE: Scholarworks URLs are generated in order; check the bottom of the generated spreadsheet to find the matching entries to the bpimport spreadsheet-in-process. Copy the column of OCLC#'s from the bpimport spreadsheet into the generated spreadsheet and use compare row differences to verify accuracy. Delete all non-matching rows. This produces the list needed to proceed to: Batch Conversion of ETDs to e-records with Scholarworks URLs, Upload file of longer-version records into Connexion.

Notify Meghan, Lisa and Jessica Adamick

  1. Create an “upload report” from the bpimport spreadsheet with Title, URL (pasted in from the previous URLs776 spreadsheet), Authors, M/P, and Program included.
  2. Send an email notification with upload report spreadsheet attached.

Troubleshooting notes:

Check the upload date from an Administrator Report on the series: http://scholarworks.umass.edu/cgi/editor.cgi?window=report&context=dissertations_1

The error message will includes a list of titles. The problem record is usually the one after the last title listed in the error.

Unicode errors: These can result from problem diacritics or symbols in the original records, carried over into MARCXML. Check the MarcEdit “mb” file for character(s) associated with the titles in the error message and remove/replace them, reconvert to MARC and MARCXML and re-run the bpress conversion.

Internet Archive errors: These can result from problems with the Internet Archive links, for example if permissions are lacking, or some entries are duplicates to previously-scanned ETD's. Check the TOTALS folder under ETD Digitization Project on the W: Drive for duplicates; check the Internet Archive linkage; if necessary, contact Tim Bigelow of Internet Archive to resolve the error.

Adding “OPT-OUTS” TO Scholarworks Metadata Spreadsheets

NOTE: The Opt-in/Opt-out information will be found in the ETD Master File, saved on the W: Drive under ETD Digitization Project, in Sheet 2, “all titles,” Column R, “Authors Permissions Response: Opt-in, Opt-out, no response.”

  1. Make a new copy of Sheet 2, retaining only columns R (Opt-ins/Opt-outs) and I (Advance ID). Make sure the data is connected, with no blank columns between.
  2. Make a new copy of Sheet 1 entitled “all items,” retaining only columns E (Bib Nbr), F (OCLC Nbr) and P (Advance ID), making sure the data is connected.
  3. Combine the two copies with blank columns between. Sort by Advance ID and compare row differences. When match is established, delete 1 column of Advance ID information, and connect the rest of the columns.
  4. Sort by R, and delete all rows which do not contain “Opt-out” NOTE: To prevent Excel from freezing up, delete as much extraneous information as possible, before working any function which requires Paste Values.
  5. Produce a list of 776's from the bpimport spreadsheets for Masters and PhDs. Copy to the right side of the worksheet containing the Opt-outs, leaving a few blank columns between. This will be the short list. Enter “match” all the way down the column to right of the short list, and work a VLOOKUP, comparing to the long OCLC list.
  6. VLOOKUP breakdown: Place cursor in the cell just to right of the block of information on the right side of the spreadsheet, and enter the formula, as follows: =VLOOKUP(f2:f274,$i$2:$j$70,2,false), using either upper or lower case. Explanation: f2=cell of first entry in large list (numbered by column and row) (could be a, d, e, m, whatever), f274=cell of last entry in the list. This is the target list being scanned for matches. $i$2=first cell of small list, $j$70=last cell of “match” column to right of small list. “False” indicates that the large list is being tagged with information NOT from the small list (but from an adjacent column), and “2” indicates how faraway that column is. In this case, the word “match” will appear in place of the formula when it is entered in a row with an OCLC match. If no match is found, #N/A will appear.
  7. Highlight the column in which VLOOKUP was entered, hit Control-C (copy) and under File, click Paste/Paste Values/123. This will remove the formula from the data. Sort by that column. If any matches are present, they should appear at the top, indicating any OCLC numbers in the current bpimport spreadsheets which have Opt-outs associated with them.

NOTE: Format columns to be matched in the VLOOKUP as “General” or else it won't work.

Primary contacts: Lucy deGozzaldi, Meghan Bergin.

batch_uploading_etds_to_scholarworks.1485970097.txt.gz · Last modified: 2019/01/07 17:20 (external edit)
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0