Skip to Main Content

ENGG/ENGM 182: Data Analytics: Using Data

Data Responsibilities

Dartmouth subscribes to numerous resources, but datasets are expensive and we can't provide everything. There are also important things to remember when working with data sets:

Data is licensed

Whether thorough our subscriptions or in open online repositories, data is typically licensed in some form or the other. These licenses restrict how or by whom the data may be used, where it can be used, and how much of the data is available for download, and how it should be referenced in the new work. Be sure to examine the metadata of the dataset for this information. You may see a Creative Commons license, but there are many types of Creative Commons License and you should examine the rules of each license individually.

Not all data is created equally

Just because you found a dataset available for use doesn't mean it is useful. If a dataset does not have accompanying information, including complete metadata, sources of the data, and a README file informing you how to interpret the data (including what the column headers in the tabular file mean), it may be useless to you. Be sure to check for accompanying metadata and README files when selecting a dataset.

Data should be cited

Data has historically been published less often than the articles it informs, and therefore many citation formats are still grappling with how to best cite data. The important part is that you do cite the dataset. Most have a stable URL (often a DOI) at this point, and most styles have attempted to solidify a citation format for datasets. If you have questions about citing your datasets, ask your librarian.

Text Scraping & Mining in Subscribed Databases: Please Don't

While text scraping and mining is an incredibly useful tool that we encourage you to use in responsible ways, please refrain from using mass download, scraping, or mining codes within databases or resources subscribed to by Dartmouth Libraries. These activities are expressly prohibited by our licenses unless otherwise stated, and could result in the immediate denial of access to the researcher, as well as the termination of our license for that resource. We do not want to discourage you from practicing with these tools however, and encourage you to reach out to our Research Data Services department or check this Text Analysis guide for assistance.

Citing Data

Data Citation Formatting
Recommended formats for data citation:

Basic data citation
     Creator (Publication Year). Title. Publisher. Identifier

Data citation with resource type and identifier
      Creator (Publication Year). Title. Version. Publisher. Resource Type. Identifier

Interpreting Metadata

When seeking out data, students are encouraged to consider the following questions:

  • What is the source of this data?
    • Knowing who authored the dataset is important when considering the influences on data gathering, cleaning, and use.
  • How big is the file? How many lines of data should I expect?
    • If you are looking for many lines for analysis, a small file may not be big enough for the conclusions you hope to draw.
  • What filetypes are available?
    • Most tabular data will be in .csv format, but code files may be in many formats. You want to make sure whatever you download is accessible to you, and if it is not, it can help you understand what is needed to use those files.
  • What accompanying READMEs or files are there?
    • Data is practically useless without information about how it was gathered, when it was gathered, how it was cleaned or processed, what code should be or has been used to process it, what the tabular labels mean (columns are often numbered or abbreviated for ease of use in the original context), and if any of the data has been inferred or if it is all clean but unchanged data.
  • What is the use license?
    • Checking how the data can be used before you start ensures you don't work with a restricted dataset.
  • How much data am I allowed to download at once?
    • Most resources will allow full downloads of the datasets, but there are some subscribed resources that restrict the number of lines you can download at once. We try to state on resource summaries if there is a known limit. If you run into sudden failed downloads, you may have hit an unstated limit. Please contact a librarian for clarification or assistance.